Making load tests life-like

Any test case is an approximation of what real users will do in a system, based on hypotheses about user behavior. That is especially true in load tests, when the objective is to test the system's scalability and performance under "real life" conditions.It's obvious that the quality of test results will highly depend on how closely the test managed to simulate real conditions.

If you have a site that allows navigating via categories or using a full-text search, how you scale your back-end systems is likely to depend very much on how much category-navigation vs. searching activity you expect. And if the usage assumptions turn out to be far away from reality, you risk showing up at a search party with a system optimized for navigation, or vice versa.

In monolithic systems, making such wrong assumptions about user behavior is less tragic than in distributed (microservice) systems: after all, scaling out a single back-end service that serves both navigation and full-text search will probably help performance of both, but scaling out a separate full-text search service will do nothing to help category navigation. And while it's safe to say that autoscaling services help adapt to changing load patterns, there are enoughscalability bottlenecks in any non-trivial system (hidden and visible) to even bring even an autoscaling system to a grinding halt once they are reached.

So let's concentrate on the following to questions to make load tests as life-like as possible:

  1. How can we measure how closely a load test resembles real-life use?
  2. How can we adapt load tests to make them as life-like as possible?

For question 1, we will concentrate on systems that are already in production and are being used.For new systems, there is not really an alternative to making educated guesses about user behavior, but for existing systems we can definitely compare real behavior and simulated behavior.

In many situations, replaying actual user traffic would be the best way to test a system (and obviously the most life-like, since it is actual real-life usage!). However,there are difficulties in using the approach outside of small component-focused tests. Using actual traffic to test a full-text search engine is a good way to test it - no data is modified as part of the search, and searching is usually not contextual, meaning that simulated users don't need to do certain steps in a certain order and maintain data in a session. But in a more complex, there are many issues with traffic replays:

  • Traffic that modifies data, which is an extremely difficult proposition when testing in a production environment
  • Stale data in traffic captures, which will cause tests to fail when referencing data that was already deleted
  • Dynamic data such as session ids that needs to be recalculated and replaced in the traffic replay, e.g. cookies and session ids

Many testers will choose instead to set up "synthetic" test scenarios instead, so that simulated users follow a predefined flow. This is where the question "how life-like is the test" becomes important. Fortunately, the quality of such test scenarios can be calculated when comparing them with traffic logs from real usage (the same type of data that would also be used to replay traffic).It's just a matter of observing both real users using the system, and simulated users doing the same, and comparing the data.

Using analysis tools, you can fairly easily compare the traffic patterns along several dimensions:

  1. URL Patterns and actions (cluster URLs and count them, compare their share of total traffic)
  2. "Burstiness" of traffic (how high is the standard deviation of request counts over time?)
  3. Variance of data used, e.g. how many different user-ids or products are used?

Using this data, you can incrementally change the load test to closely approximate reality. Obviously, you will only iterate up to a certain point where the traffic share for most functionality is reasonably close to real usage, and there are no major gaps left by the load test.

It's easy to see how machine learning can help with this task in the future. A robot could make adjustments to the parameters in load test scenarios with a goal of minimizing the difference between simulated traffic pattern and real usage.

Once a realistic distribution of traffic has been achieved in the load test, what remains to be done is to scale the whole load test up - just add more clients triggering the same kinds of actions, and see when the system breaks when you multiply the number of simulated users on the system. Discover scalability limits, address them, rinse and repeat.

See "10 steps for fixing scalability issues in large-scale web applications" for an overview of methods to address scalability issues.