Slow Cooker: Load testing for tough software

Linkerd, our service mesh for cloud-native applications, needs to handle very high volumes of production traffic over extended periods of time. In this post, we’ll describe the load testing strategies and tools we use to ensure Linkerd can meet this goal. We’ll review some of the problems we faced when trying to use popular load testers. Finally, we’ll introduce slow_cooker, an open source load tester written in Go, which is designed for long-running load tests and lifecycle issue identification.

As a service mesh, Linkerd acts as a transparent proxy, taking requests destined for a particular service and adding connection pooling, failure handling, retries, latency-aware load balancing, and more. In order to be a viable production system, Linkerd needs to handle very high request loads over long periods of time and a variety of conditions. Happily, Linkerd is built on top of netty and Finagle, some of the most widely tested and production-vetted network code known to man. But code is one thing; performance in the real world is another.

To assess production behavior, Linkerd needs to be vetted by extensive and rigorous load testing. Furthremore, since Linkerd is a part of underlying infrastructure, Linkerd instances are rarely stopped or restarted—a single Linkerd instance may see billions of requests over a variety of client and service behaviors. This means we must also test for lifecycle issues. For high-throughput network servers like Linkerd, lifecycle issues include memory leaks, socket leaks, bad GC pauses, and periodic network or disk saturation. While these things happen infrequently, if they aren’t handled properly, the results can be catastrophic.

Who tests the testers?

Early on in Linkerd development, we used popular load testers like ApacheBench and hey. (Of course, these are HTTP-specific, and Linkerd proxies a variety of protocols, including Thrift, gRPC, and Mux—but we needed to start somewhere.)

Unfortunately, we quickly found that while these tools were great for getting a quick read on performance, they weren’t great for identifying the lifecycle issues we wanted to capture. These tools would provide a single end-of-run summary, which could mask real issues. They also relied on means and standard deviations, which we knew was a problematic way to characterize system performance.

For capturing lifecycle issues, we wanted both better metrics and to the ability to see how Linkerd was performing over long tests runs of hours or days rather than minutes.

Slow cooking for tender code

Since we couldn’t find a tool that did what we needed, we built one: slow_cooker. slow_cooker is a load tester designed explicitly for long-running load tests to identify lifecycle issues. We use slow_cooker extensively to find performance issues and test changes in our products. It features incremental progress reports, change detection, and comprehensive metrics.

Today, we’re open sourcing slow_cooker for others to use and contribute to. You can check out the source on GitHub or try out the recently released 1.0 version.

Let’s take a look at some of slow_cooker’s features.

(For the sake of simplicity, we’ll show the output of slow_cooker when we change performance characteristics of the downstream services. In practice, of course, we use slow_cooker primarily to identify problems with Linkerd, not the services it’s talking to.)

Incremental latency reports

slow_cooker has an incremental reporting approach, motivated by our focus on finding lifecycle issues over a long period of time. Too much can get lost when looking at an aggregate report over a very large amount of data—especially for transient issues like GC pressure or network saturation. With incremental reports, we can see throughput and latency trends or changes in a running system.

In the example below, we show slow_cooker output from load testing Linkerd. Our test scenario has Linkerd load balancing across 3 nginx backends, each serving static content. The latencies given are in milliseconds, and we report the min, p50, p95, p99, p999, and max latencies seen during this 10 second interval.

$ ./slow_cooker_linux_amd64 -url http://target:4140 -qps 50 -concurrency 10 http://perf-target-2:8080
    # sending 500 req/s with concurrency=10 to http://perf-target-2:8080 ...
    #                      good/b/f t     good%   min [p50 p95 p99  p999]  max change
    2016-10-12T20:34:20Z   4990/0/0 5000  99% 10s   0 [  1   3   4    9 ]    9
    2016-10-12T20:34:30Z   5020/0/0 5000 100% 10s   0 [  1   3   6   11 ]   11
    2016-10-12T20:34:40Z   5020/0/0 5000 100% 10s   0 [  1   3   7   10 ]   10
    2016-10-12T20:34:50Z   5020/0/0 5000 100% 10s   0 [  1   3   5    8 ]    8
    2016-10-12T20:35:00Z   5020/0/0 5000 100% 10s   0 [  1   3   5    9 ]    9
    2016-10-12T20:35:11Z   5020/0/0 5000 100% 10s   0 [  1   3   5   11 ]   11
    2016-10-12T20:35:21Z   5020/0/0 5000 100% 10s   0 [  1   3   5    9 ]    9
    2016-10-12T20:36:11Z   5020/0/0 5000 100% 10s   0 [  1   3   5    9 ]    9
    2016-10-12T20:36:21Z   5020/0/0 5000 100% 10s   0 [  1   3   6    9 ]    9
    2016-10-12T20:35:31Z   5019/0/0 5000 100% 10s   0 [  1   3   5    9 ]    9
    2016-10-12T20:35:41Z   5020/0/0 5000 100% 10s   0 [  1   3   6   10 ]   10
    2016-10-12T20:35:51Z   5020/0/0 5000 100% 10s   0 [  1   3   5    9 ]    9
    2016-10-12T20:36:01Z   5020/0/0 5000 100% 10s   0 [  1   3   5   10 ]   10

In this report, good% measures throughput: how close we’re getting to the requested RPS (requests per second).

This report looks good—the system is fast and response times are stable. When things go bad, however, we need that fact to come across clearly. We designed slow_cooker’s output to make it easy to visually scan for issues and outliers by using vertical alignment and a change indicator helps us to spot outliers in latency. In the example below, we have a backend server suffering from a catastrophic slow down:

$ ./slow_cooker_linux_amd64 -totalRequests 100000 -qps 5 -concurrency 100 http://perf-target-1:8080
    # sending 500 req/s with concurrency=10 to http://perf-target-2:8080 ...
    #                      good/b/f t     good%   min [p50 p95 p99  p999]  max change
    2016-11-14T20:58:13Z   4900/0/0 5000  98% 10s   0 [  1   2   6    8 ]    8 +
    2016-11-14T20:58:23Z   5026/0/0 5000 100% 10s   0 [  1   2   3    4 ]    4
    2016-11-14T20:58:33Z   5017/0/0 5000 100% 10s   0 [  1   2   3    4 ]    4
    2016-11-14T20:58:43Z   1709/0/0 5000  34% 10s   0 [  1 6987 6987 6987 ] 6985 +++
    2016-11-14T20:58:53Z   5020/0/0 5000 100% 10s   0 [  1   2   2    3 ]    3 --
    2016-11-14T20:59:03Z   5018/0/0 5000 100% 10s   0 [  1   2   2    3 ]    3 --
    2016-11-14T20:59:13Z   5010/0/0 5000 100% 10s   0 [  1   2   2    3 ]    3 --
    2016-11-14T20:59:23Z   4985/0/0 5000  99% 10s   0 [  1   2   2    3 ]    3 --
    2016-11-14T20:59:33Z   5015/0/0 5000 100% 10s   0 [  1   2   3    4 ]    4 --
    2016-11-14T20:59:43Z   5000/0/0 5000 100% 10s   0 [  1   2   3    5 ]    5
    2016-11-14T20:59:53Z   5000/0/0 5000 100% 10s   0 [  1   2   2    3 ]    3
    FROM    TO #REQUESTS
       0     2 49159
       2     8 4433
       8    32 8
      32    64 0
      64   128 0
     128   256 0
     256   512 0
     512  1024 0
    1024  4096 0
    4096 16384 100

As you can see, the system is fast and responsive except for a hiccup at 2016-11-14T20:58:43Z. During this hiccup, our throughput dropped to 34% and then returned to normal. As a service owner, you’d want to look into your logs or performance metrics and investigate the root cause.

Lifecycle issue example: GC pause

In order to demonstrate how incremental reporting can provide benefits over a single final report, let’s do a simulation of a backend service having GC trouble. In this example, we’ll test directly against a single nginx process serving static content, and in a loop we’ll continually pause and then unpause the process at 5 second intervals (using kill -STOP $PID and kill -CONT $pid).

For comparison, let’s start with a ApacheBench’s report:

$ ab -n 100000 -c 10 http://perf-target-1:8080/
    This is ApacheBench, Version 2.3
    Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
    Licensed to The Apache Software Foundation, http://www.apache.org/

    Benchmarking perf-target-1 (be patient)
    Completed 10000 requests
    Completed 20000 requests
    Completed 30000 requests
    Completed 40000 requests
    Completed 50000 requests
    Completed 60000 requests
    Completed 70000 requests
    Completed 80000 requests
    Completed 90000 requests
    Completed 100000 requests
    Finished 100000 requests

    Server Software:        nginx/1.9.12
    Server Hostname:        perf-target-1
    Server Port:            8080

    Document Path:          /
    Document Length:        612 bytes

    Concurrency Level:      10
    Time taken for tests:   15.776 seconds
    Complete requests:      100000
    Failed requests:        0
    Total transferred:      84500000 bytes
    HTML transferred:       61200000 bytes
    Requests per second:    6338.89 [#/sec] (mean)
    Time per request:       1.578 [ms] (mean)
    Time per request:       0.158 [ms] (mean, across all concurrent requests)
    Transfer rate:          5230.83 [Kbytes/sec] received

    Connection Times (ms)
                  min  mean[+/-sd] median   max
    Connect:        0    0   0.2      0       3
    Processing:     0    1  64.3      0    5003
    Waiting:        0    1  64.3      0    5003
    Total:          0    2  64.3      1    5003

    Percentage of the requests served within a certain time (ms)
      50%      1
      66%      1
      75%      1
      80%      1
      90%      1
      95%      1
      98%      1
      99%      2
     100%   5003 (longest request)

Here we see mean latency is 1.5ms, but some outliers have high latency. It would be easy to misread this report as healthy even though the backend service is unresponsive for fully half of the test run. If your target SLA is 1 second, then your service is out of SLA for more than half of the test run—but you might never suspect that from this report!

With slow_cooker’s incremental results, however, we can see that there’s a consistent throughput bottleneck that needs deeper investigation. Also, it becomes much more clear that the 99.9th percentile is consistently high; this is not just a few outliers, but a persistent and ongoing problem:

$ ./slow_cooker_linux_amd64 -totalRequests 20000 -qps 50 -concurrency 10 http://perf-target-2:8080
    # sending 500 req/s with concurrency=10 to http://perf-target-2:8080 ...
    #                      good/b/f t    good%    min [p50 p95 p99  p999]  max change
    2016-12-07T19:05:37Z   2510/0/0 5000  50% 10s   0 [  0   0   2 4995 ] 4994 +
    2016-12-07T19:05:47Z   2520/0/0 5000  50% 10s   0 [  0   0   1 4999 ] 4997 +
    2016-12-07T19:05:57Z   2519/0/0 5000  50% 10s   0 [  0   0   1 5003 ] 5000 +
    2016-12-07T19:06:07Z   2521/0/0 5000  50% 10s   0 [  0   0   1 4983 ] 4983 +
    2016-12-07T19:06:17Z   2520/0/0 5000  50% 10s   0 [  0   0   1 4987 ] 4986
    2016-12-07T19:06:27Z   2520/0/0 5000  50% 10s   0 [  0   0   1 4991 ] 4988
    2016-12-07T19:06:37Z   2520/0/0 5000  50% 10s   0 [  0   0   1 4995 ] 4992
    2016-12-07T19:06:47Z   2520/0/0 5000  50% 10s   0 [  0   0   2 4995 ] 4994
    FROM    TO #REQUESTS
       0     2 19996
       2     8 74
       8    32 0
      32    64 0
      64   128 0
     128   256 0
     256   512 0
     512  1024 0
    1024  4096 0
    4096 16384 80

Percentile-based latency reporting

As we see from the ApacheBench example above, some load testing tools will only output average and standard deviation. However, these metrics are usually inappropriate for system latencies. Latency does not follow a standard distribution, and often has very long tails. With slow_cooker, we discard mean and stddev entirely, showing instead the minimum, maximum, and a handful of higher-order percentiles (50th, 95th, 99th, and 99.9th). This approach has seen increased adoption in modern software systems, where a single request can result in dozens or even hundreds of queries to other systems. In these situations, metrics like the 95th and 99th percentiles represent the dominant latency for end users.

Conclusion

Although writing a load generator is not ultimately particularly difficult, especially with modern, concurrent, network-oriented languages like Go, the details of reporting and measuring can make a significant difference in the utility of the tool.

Today, we use slow_cooker extensively to test Linkerd as well as other projects in the ecosystem (e.g. nginx). We currently run 24x7 tests against Linkerd in the context of complex multi-service software. slow_cooker has helped us not only keep buggy code from being deployed, but it has also identified performance problems in existing code. Usage of slow_cooker has become so pervasive at Buoyant that we refer to load testing a piece of software as “slow cooking” it.

You can get started using slow_cooker today by visiting the Github releases page. Download the tool and fire it at your favorite backend to start vetting it for performance issues. We hope you’ll find it as useful in your setup as we have in our tests of Linkerd.

Further reading

  1. Throughput-Delay Curves
  2. Frequency Trails: What the Mean Really Means
  3. Simulating Byzantine Failure with SIGSTOP
  4. Everything You Know About Latency Is Wrong
  5. How NOT to measure latency