How we load test to ensure application performance

Our framework for designing, running, and iterating tests

load-testing-hero

As engineers, we know performance isn't just a feature; it's fundamental to our customer experience. Recently, my teammate, Jonny, and I undertook an extensive load testing effort to proactively find and fix scalability issues before they become production issues for customers.

While load testing is just one piece of our overall performance strategy, it's a crucial one. Monitoring and optimization help us address current issues, while load testing allows us to peer into the future and ask the critical question: "What happens as our usage grows?"

We answer this question by simulating increased usage patterns in controlled environments to identify bottlenecks, stress points, and performance degradation that might emerge under higher loads. This proactive approach helps us ensure that the Omni app continues to deliver the experience our customers expect as their user base grows and usage patterns evolve.

 In this post, I'll walk you through our journey, including:

I’ll also share some of our key learnings, such as:

  • Trying before buying helps focus on the issues that matter most

  • Testing often reveals unexpected bottlenecks

  • Focus on issues that may degrade the customer experience first, optimize second

  • Collaboration is key to success, especially for something deep in the weeds

  • Ensure bottlenecks are the result of genuine application challenges, not test artifacts

What our load testing does and doesn’t cover #

Before diving too far into our approach, it's worth clarifying what parts of our system the load tests do and don’t cover. In a typical web application stack, when a customer loads a website in their browser, these are some of the factors you should consider 👇

Components that can impact performance #

  • Client-side factors: The browser, user's machine (memory, CPU), & client-side rendering of JS/HTML/CSS

  • Network considerations: Connection speed between the customer's machine & our backend or content delivery network (CDN), & CDN delivery times

  • Backend execution: Everything from code execution to communication between components, internal databases, reads/writes to disk or cache

  • External dependencies: In our case, the backend also needs to communicate with the customer's database

Our load tests primarily focused on the backend components: API endpoints, database queries, and server resource utilization, which includes communication with the customer database. We started by emulating the HTTP requests browsers send to our backend since this is excellent for identifying bottlenecks in request handling, data processing, and database performance under increased concurrent usage.

However, these tests won't catch client-side rendering inefficiencies, real-world network variability, or the full complexity of third-party integrations. That's why load testing is just one component of our broader performance strategy, alongside things like user monitoring and targeted performance profiling.

How we designed our load tests #

Step one: Setting our test goals #

First off, we had to define what we wanted to learn from our load testing efforts. For our initial testing, we didn't set a specific response time goal; we decided to define throughput rates (requests per second) and ensure we had sufficient monitoring to identify the bottlenecks with each run.

Our testing was designed to answer critical questions, including:

  • When (and where) will our system hit performance limits? Identifying the exact components that would become bottlenecks first as load increased would allow us to prioritize optimization efforts effectively.

  • Can we handle projected growth? By analyzing current production patterns and anticipated increases, we could validate whether our system would remain responsive as our customer base and usage expanded.

Of course, response time matters tremendously—after all, it's what our customers actually experience. Our approach was to keep fixing bottlenecks as long as the solutions were relatively inexpensive. Once we reached a point where optimizations became costly or complex, we needed to discuss whether the current response times were sufficient for our customers’ needs.

This pragmatic approach allowed us to make incremental improvements and reassess after each test run, which gave us the flexibility to pivot in whichever direction the tests pointed us.

Step two: Building the load profile #

Next up, we set out to build a realistic load profile to uncover potential bottlenecks. This step is especially critical because a poorly designed test scenario can miss critical bottlenecks or optimize for situations that are unlikely to occur in production. It’s part art and part science, so we recommend involving multiple people to continually challenge assumptions.

We focused on our embedded analytics dashboards as our primary test target. These dashboards are used by our customers' customers and they experience high traffic volumes—ideal for scale testing. Plus, optimizations we made for embedded dashboards would benefit customers’ internal BI dashboards, too.

While dashboards aren't the only parts of the application that can experience performance issues, other problems are easier to identify without load testing, so we have other systems to identify and optimize potential issues. For this exercise, we specifically set out to identify bottlenecks that would emerge under higher concurrent usage because those are problems that might not be visible until they impact customers in production.

Example challenge: How do we handle the client database?

One key challenge was deciding how to handle the client database during testing. Since these databases are outside our control but integral to our workflow, we needed a strategy that would:

  1. Exercise our entire execution path, including connection establishment & result caching

  2. Minimize the impact on the client database during testing so it didn’t become the bottleneck 

  3. Provide consistent results that weren't skewed by external factors

Our solution was to replace real queries with simple "SELECT 1" statements that would follow the same execution path but place minimal load on the client database. This allowed us to isolate and identify bottlenecks in our own system while still testing the full workflow. 💡 Remember this point later; it was a decision we had to revisit!

To create the most realistic test conditions, we carefully defined which dashboard features to test or exclude and which parameters to vary across different test runs.

Addressing dashboard tile count variance

The number of tiles per dashboard can significantly impact load times, even under minimal traffic, and our customers' dashboards have great variance in tile count. Rather than testing with a fixed number of tiles, we conducted tests ranging from 1 to 10 tiles per dashboard to understand how this parameter affected response times under increasing load. 

Disabling cache reading

We also created a more challenging testing environment by deviating from typical usage patterns: we disabled reading from the cache on subsequent requests. While cache reading is an excellent optimization in production, our load tests would repeatedly use the same queries, unlike real-world scenarios where different users make varied requests. This helped us simulate diverse user activity and prevent our results from being artificially improved by unrealistic cache hit rates.

Measuring throughput and response times

When designing our load profile, we established a critical principle: measure and analyze both throughput and response time together. We recognized that response time alone isn't an effective performance measure without considering the request volume. A system responding in 100ms at 10 requests per second might jump to 750ms at 50 requests per second. Without specifying both metrics together, comparisons between test runs become meaningless. This dual-metric approach became essential for validating improvements as we optimized.

Step three: Tool selection #

Our journey began with the tool k6 because our backend team had already used it successfully in other projects. Most modern load testing tools would have sufficed for the load patterns we wanted to emulate, but k6 offered some specific advantages that made it particularly well-suited for us:

  • Familiarity with JavaScript: k6 allows scripting in JavaScript, which is already part of our tech stack.

  • Cloud-based infrastructure: Grafana's cloud offering for k6 meant we didn't have to set up and maintain our own infrastructure for load generation or reporting. This allowed us to quickly generate high load levels and visualize the results without the overhead of building or managing testing infrastructure.

  • Browser-based testing: As we progressed, we expanded to using the k6 browser to build a more complete performance picture by measuring how long it took for panels to actually render in the browser, not just server response times. (Note: browser testing would require a full blog to do it justice, so we won’t discuss it more here. We may write a follow-up in the future 😀)

K6 is by no means the only load generation tool with these benefits, but it was the most convenient for us. Given the complexity of the overall project, it felt great to have at least one easy choice 😅

Step four: Setting up the test environment #

Next, we needed to decide where to run our tests, which would significantly impact the validity and usefulness of our results.

We chose to create a separate load testing environment that closely mirrored our production setup. While we could have opted for a scaled-down environment and extrapolated results, we wanted to eliminate any artificial bottlenecks that might arise if different components weren't properly scaled relative to each other. This approach gave us confidence that the performance limitations would likely occur on production under similar circumstances and were thus worth fixing.

Our infrastructure is built entirely in AWS using CDK (Cloud Development Kit)—giving tremendous flexibility. Thanks to our cloud engineer's excellent work, we had a system that was easy to alter and configure as needed.

One important deviation from our production configuration: we disabled the rate limiter that normally protects against DDOS attacks. Since our load test traffic would be coming from a limited number of IPs, it would have been flagged as suspicious and throttled, undermining our ability to generate high load levels.

As our testing progressed, we sometimes deliberately scaled up specific components that we suspected might be bottlenecks. This allowed us to distinguish between code or configuration issues to fix versus situations where simply adding more resources would solve the limitation. 

For example, if the entire database was struggling rather than just a single inefficient query, we knew we'd reached a point where upsizing the database instance made sense. Upscaling comes with an additional cost, so optimizing vs. beefing up the hardware was a key consideration.

This production-like environment gave us the confidence that our optimizations would translate directly to real-world improvements once deployed.

Running the load tests #

That was a lot of setup before even running our first test! At last, we were finally ready to start our testing cycles. This is where the real fun started.

Proper test monitoring #

Effective monitoring during load tests is essential—if all we could see was the response time, it would be hard to know where to improve. We built our monitoring approach on three levels:

  1. Application-level metrics: Our awesome backend team had already added granular timing spans to indicate which portions of our system are working hardest during tests

  2. Infrastructure monitoring: We leveraged our existing production monitoring for CPU/memory usage and added AWS Database Insights, which proved invaluable for identifying blocking queries and database performance issues

  3. End-to-end timing: Measurement of complete request cycles from initiation to completion

Grafana's out-of-the-box comparison view was particularly helpful for comparing performance between test runs. Being able to overlay results from multiple runs on a single chart made it easy to visualize whether our changes were actually improving performance vs. simply shifting bottlenecks elsewhere.

load-testing-inlineimage1

Keeping it iterative #

After each run, we analyzed the basic stats—response time and throughput—to check progress. Then, we'd dive into the metrics to find the current bottleneck:

  1. Run the test

  2. Identify the slowest component using our monitoring

  3. Implement a fix for that component

  4. Run the test again to confirm improvement

  5. Repeat until we hit our targets or run into diminishing returns

The key was changing only one thing at a time between test runs. This could feel frustrating, but it's actually the fastest path to improvement. The few times we made multiple simultaneous changes, we had to redo them as separate tests anyway to understand their individual impacts.

Most performance optimizations involve trade-offs. For example, adding a database index might speed up reads but slow down writes. By isolating each change, we could accurately measure both the benefits and potential downsides, ensuring we only implemented optimizations that provided genuine value without introducing new problems elsewhere.

Results & learnings #

Early wins: Database query optimization #

Our initial bottlenecks were often missing indexes—a common but impactful issue in database performance. Under light load, the impact of a missing index was minimal, which is why we didn’t notice this before. But under heavier load, the database quickly became our first bottleneck. And our second. And maybe even our third and fourth! AWS Database Insights was extremely helpful here, allowing us to identify problematic queries and their execution plans.

load-testing-inlineimage2

With this added info, we were able to use a rapid experimentation approach to:

  1. Identify a problem query through AWS Database Insights

  2. Look at its explain plan which led to a missing index

  3. Apply the change directly to the database

  4. Immediately run another test to verify the impact & recheck insights to check the query impact

  5. Only then, codify the change properly if it proved effective

This "try before you buy" approach saved us considerable time and focused our energy on changes that actually mattered. While we experimented with various database configuration changes, many didn't yield significant improvements, so we avoided the overhead of implementing them permanently. 

Bottleneck example: The job state storage saga #

One particularly interesting bottleneck involved our job state storage system. This is a perfect example of how performance optimization is rarely a "one and done" fix—it's an ever-evolving, back-and-forth process.

In a dashboard request, we create multiple jobs in our backend. Each dashboard tile needs to be planned and, depending on whether we have cached results, potentially executed. This approach lets us process each tile in parallel and stream the fastest results back to the browser as soon as they’re ready, rather than forcing the customer to wait for the longest tile to complete before seeing anything.

load-testing-inlineimage3

These jobs need coordination, and we use a state store in our database to pass information between stages. Any time a job is picked up, errors out, or completes, the store must be updated. As you can imagine, both read and write activity on this store are high during a load test with hundreds of thousands of tiles rendering!

Here's how that particular optimization journey unfolded:

  1. Initial discovery: We identified job state storage as a significant bottleneck during a load test.

  2. Database optimization: Our first instinct was to optimize the database where the store is kept. However, even after optimization, we faced an interesting discrepancy: our database reported that queries were completing quickly, but our application still showed slow response times.

  3. Adding metrics: Our next move was to add additional instrumentation to our backend to further segment time spent interacting with the job state store.

  4. Connection library issues: This led us to investigate the connection path between our application and the database. We discovered that our database connection library (r2dbc) was the bottleneck. So, we swapped it out for JDBC with minimal code changes, which resolved the immediate issue. 

This example illustrates an important principle in performance testing: sometimes, the bottleneck isn't where you initially think it is

The database queries were fast, but the connection library couldn't keep up under load—something we might never have discovered without systematic load testing. Adding metrics to get us granular data helped us pinpoint exactly where to investigate so we could continually identify and resolve system constraints.

Another key lesson we learned was to differentiate between components that are simply slow versus those that degrade under load

For example, our login flow could arguably be faster, but its performance remained consistent regardless of load on the system. While we'll certainly make it faster in the future, it wasn't causing a worse user experience under increased load. Thus, we prioritized fixing the components that degraded under pressure.

This approach—focusing on bottlenecks that specifically emerge or worsen under load—gave us the biggest bang for our optimization buck. It meant we were addressing issues that would actually impact user experience as our system grew, rather than getting distracted by optimizing components that, while not optimal, would scale adequately.

Everything is better with a buddy #

Throughout this process, we found it invaluable to bounce ideas with teammates. 

When you're deep in performance analysis, it's easy to develop tunnel vision or miss obvious solutions. A fresh perspective often leads to breakthrough moments, and even just explaining a problem out loud (the classic "rubber duck debugging" approach) can reveal insights you might otherwise miss.

We made a point of communicating our findings and experiments in a public Slack channel rather than in direct messages. This allowed others in the company who weren't actively involved in the load testing effort to chime in with suggestions or just keep up with what we were changing. This open approach helped us solve problems faster and also spread performance knowledge throughout the organization.

Don't let your load generator (or test design) be your bottleneck! #

It was not always smooth sailing. We ran into a few interesting situations where our recipe of test, fix, retest, and celebrate didn’t work. We always began with assuming the bottleneck was in our system—but sometimes it was in how we were testing it.

When your load generator can't keep up

As we ramped up our testing with increasingly complex filter parameters, we encountered a puzzling situation: a drop in throughput, but no corresponding increase in response times. This is counterintuitive—typically, when your system is struggling, response times increase at the same time as throughput drops.

After trying a variety of Hail Mary fixes that did nothing to move the needle, we took a step back and realized we needed to look at the load generator itself. 

To verify this theory, we added detailed metrics collection to the k6 script. We modified our test scripts to store the time spent on various operations: setting up requests, waiting for responses, and processing response data. We made these metrics quite granular so we could pinpoint exactly where the slowdown was occurring within the testing tool itself.

Our investigation revealed an unexpected issue: URL generation using Node's URL library was surprisingly CPU-intensive on k6's cloud infrastructure. The complex URLs we were constructing for our dashboard requests, particularly those with extensive filter parameters, were taxing the load generators themselves! What should have been a sub-second operation was taking upwards of 15-20 seconds in some instances. This could definitely account for the drop in throughput we were encountering.

The solution was surprisingly simple: we switched from using Node's URL library to basic string concatenation for constructing our URLs. This change dramatically reduced CPU usage on the load generation machines, allowing them to generate higher throughput and reveal the true performance characteristics of our application.

Test design artifacts vs. real-world bottlenecks

We also encountered another category of misleading bottlenecks: those created by our test design rather than reflecting real-world usage patterns.

💡 Remember our clever "SELECT 1" approach to minimize impact on client databases? We discovered that this created an artificial bottleneck at extremely high concurrency. Because every test query was hitting the exact same cache entry with an update attempt, we experienced cache contention that wouldn't occur in real-world scenarios.

In production, customers would typically be running different queries, not thousands of identical ones. To create a more realistic test, we modified our approach to use "SELECT X" where X was a varying number. This simple change distributed the load across cache indexes and eliminated the artificial bottleneck.

We found similar effects when testing with identical users and dashboards. Updating view counts for the same dashboard thousands of times created contention that wouldn't reflect real usage patterns.

Question everything

These experiences taught us a valuable lesson: always question whether bottlenecks are genuine application issues or test artifacts. By instrumenting both our application and our testing tools, we were able to distinguish between:

  • Real application bottlenecks that need fixing

  • Load generation limitations that need testing adjustments

  • Test design artifacts that don't reflect real-world usage

This awareness helped us focus our optimization efforts on issues that would actually impact our customers, rather than chasing phantoms created by our testing methodology.

Next steps #

Thanks for following along on our recent load testing journey. With all the work we've done, we've successfully expanded our runway to better prepare for increasing usage and ensuring our customers continue to have a great experience. We've cleared several performance hurdles and gained valuable insights into how our system behaves under pressure.

But as with any engineering endeavor, our work is far from done. Here's what we're planning next 🛣️

Regular regression testing #

We're setting up automated, scheduled load tests to catch performance regressions before they reach our customers. This "early warning system" will help maintain our performance gains and ensure new features don't inadvertently impact system responsiveness.

Bringing load testing in-house #

While Grafana's k6 cloud offering served us well initially, we're planning to run k6 within our own infrastructure using Docker on AWS EKS. This gives us more control and better cost efficiency for our ongoing testing needs.

Integrating reporting into Omni  #

We're working to leverage Omni's own visualization capabilities for our performance reporting. We love using Omni, and we were able to quickly test this in Omni at a recent offsite, so we're planning to do a lot more here soon.

Expanding our load profile #

Our next challenge is broadening our test scenarios to better reflect overall production usage. By testing more features and diverse user workflows, we'll build confidence that our entire system—not just embed dashboards—performs well under load.

The journey above is just the beginning. By being proactive about performance, we're building a more scalable, responsive system that can grow alongside our customers' needs.

If you have questions about our approach or want to share your own experiences with load and scale testing, feel free to reach out! And if you’d like to see what else our engineering team is working on, check out our weekly product development demos for a peek under the Omni hood 👀