Debugging HTTP applications with per-route metrics

This demo is of a Ruby application that helps you manage your bookshelf. It consists of multiple microservices and uses JSON over HTTP to communicate with the other services. There are three services:

webapp: the frontend
authors: an API to manage the authors in the system
books: an API to manage the books in the system

For demo purposes, the app comes with a simple traffic generator. The overall topology looks like this:

Prerequisites

To use this guide, you’ll need to have Linkerd installed on your cluster. Follow the Installing Linkerd Guide if you haven’t already done this.

Install the app

To get started, let’s install the books app onto your cluster. In your local terminal, run:

kubectl create ns booksapp && \
  curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/booksapp.yml \
  | kubectl -n booksapp apply -f -

This command creates a namespace for the demo, downloads its Kubernetes resource manifest and uses kubectl to apply it to your cluster. The app comprises the Kubernetes deployments and services that run in the booksapp namespace.

Downloading a bunch of containers for the first time takes a little while. Kubernetes can tell you when all the services are running and ready for traffic. Wait for that to happen by running:

kubectl -n booksapp rollout status deploy webapp

You can also take a quick look at all the components that were added to your cluster by running:

kubectl -n booksapp get all

Once the rollout has completed successfully, you can access the app itself by port-forwarding webapp locally:

kubectl -n booksapp port-forward svc/webapp 7000 >/dev/null &

(We redirect to /dev/null just so you don’t get flooded with “Handling connection” messages for the rest of the exercise.)

Open http://localhost:7000/ in your browser to see the frontend.

Unfortunately, there is an error in the app: if you click Add Book, it will fail 50% of the time. This is a classic case of non-obvious, intermittent failure—the type that drives service owners mad because it is so difficult to debug. Kubernetes itself cannot detect or surface this error. From Kubernetes’s perspective, it looks like everything’s fine, but you know the application is returning errors.

Add Linkerd to the service

Now we need to add the Linkerd data plane proxies to the service. The easiest option is to do something like this:

kubectl get -n booksapp deploy -o yaml \
  | linkerd inject - \
  | kubectl apply -f -

This command retrieves the manifest of all deployments in the booksapp namespace, runs them through linkerd inject, and then re-applies with kubectl apply. The linkerd inject command annotates each resource to specify that they should have the Linkerd data plane proxies added, and Kubernetes does this when the manifest is reapplied to the cluster. Best of all, since Kubernetes does a rolling deploy, the application stays running the entire time. (See Automatic Proxy Injection for more details on how this works.)

Debugging

Let’s use Linkerd to discover the root cause of this app’s failures. We can use the stat-inbound command to see the success rate of the webapp deployment:

linkerd viz -n booksapp stat-inbound deploy/webapp
NAME    SERVER          ROUTE      TYPE  SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99
webapp  [default]:4191  [default]        100.00%  0.30          4ms          9ms         10ms
webapp  [default]:4191  probe            100.00%  0.60          0ms          1ms          1ms
webapp  [default]:7000  probe            100.00%  0.30          2ms          2ms          2ms
webapp  [default]:7000  [default]         75.66%  8.22         18ms         65ms         93ms

This shows us inbound traffic statistics. In other words, we see that the webapp is receiving 8.22 requests per second on port 7000 and that only 75.66% of those requests are successful.

To dig into this further and find the root cause, we can look at the webapp’s outbound traffic. This will tell us about the requests that the webapp makes to other services.

linkerd viz -n booksapp stat-outbound deploy/webapp
NAME    SERVICE       ROUTE      TYPE       BACKEND       SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES
webapp  books:7002    [default]                            77.36%  7.95         25ms         48ms        176ms     0.00%    0.00%
                      └──────────────────►  books:7002     77.36%  7.95         15ms         44ms         64ms     0.00%
webapp  authors:7001  [default]                           100.00%  3.53         26ms         72ms        415ms     0.00%    0.00%
                      └──────────────────►  authors:7001  100.00%  3.53         16ms         52ms         91ms     0.00%

We see that webapp sends traffic to both the books service and the authors service and that the problem seems to be with the traffic to the books service.

HTTPRoute

We know that the webapp component is getting failures from the books component, but it would be great to narrow this down further and get per route metrics. To do this, we take advantage of the Gateway API and define a set of HTTPRoute resources, each attached to the books Service by specifying it as their parent_ref.

kubectl apply -f - <<EOF
kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: books-list
  namespace: booksapp
spec:
  parentRefs:
    - name: books
      group: core
      kind: Service
      port: 7002
  rules:
    - matches:
        - path:
            type: Exact
            value: "/books.json"
---
kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: books-create
  namespace: booksapp
spec:
  parentRefs:
    - name: books
      group: core
      kind: Service
      port: 7002
  rules:
    - matches:
        - path:
            type: Exact
            value: "/books.json"
          method: POST
---
kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: books-delete
  namespace: booksapp
spec:
  parentRefs:
    - name: books
      group: core
      kind: Service
      port: 7002
  rules:
    - matches:
        - path:
            type: RegularExpression
            value: "/books/\\\d+.json"
          method: DELETE
EOF

We can then check that these HTTPRoutes have been accepted by their parent Service by checking their status subresource:

kubectl -n booksapp get httproutes.gateway.networking.k8s.io \
  -ojsonpath='{.items[*].status.parents[*].conditions[*]}' | jq .

Notice that the Accepted and ResolvedRefs conditions are True.

{
  "lastTransitionTime": "2024-08-03T01:38:25Z",
  "message": "",
  "reason": "Accepted",
  "status": "True",
  "type": "Accepted"
}
{
  "lastTransitionTime": "2024-08-03T01:38:25Z",
  "message": "",
  "reason": "ResolvedRefs",
  "status": "True",
  "type": "ResolvedRefs"
}
[...]

With those HTTPRoutes in place, we can look at the outbound stats again:

linkerd viz -n booksapp stat-outbound deploy/webapp
NAME    SERVICE       ROUTE         TYPE       BACKEND       SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES
webapp  authors:7001  [default]                              100.00%  2.80         25ms         48ms         50ms     0.00%    0.00%
                      └─────────────────────►  authors:7001  100.00%  2.80         16ms         45ms         49ms     0.00%
webapp  books:7002    books-list    HTTPRoute                100.00%  1.43         25ms         48ms         50ms     0.00%    0.00%
                      └─────────────────────►  books:7002    100.00%  1.43         12ms         24ms         25ms     0.00%
webapp  books:7002    books-create  HTTPRoute                 54.27%  2.73         27ms        207ms        441ms     0.00%    0.00%
                      └─────────────────────►  books:7002     54.27%  2.73         14ms        152ms        230ms     0.00%
webapp  books:7002    books-delete  HTTPRoute                100.00%  0.72         25ms         48ms         50ms     0.00%    0.00%
                      └─────────────────────►  books:7002    100.00%  0.72         12ms         24ms         25ms     0.00%

This tells us that it is requests to the books-create HTTPRoute which have been failing.

Retries

As it can take a while to update code and roll out a new version, let’s tell Linkerd that it can retry requests to the failing endpoint. This will increase request latencies, as requests will be retried multiple times, but not require rolling out a new version. Add a retry annotation to the books-create HTTPRoute which tells Linkerd to retry on 5xx responses:

kubectl -n booksapp annotate httproutes.gateway.networking.k8s.io/books-create \
retry.linkerd.io/http=5xx

We can then see the effect of these retries:

linkerd viz -n booksapp stat-outbound deploy/webapp
NAME    SERVICE       ROUTE         TYPE       BACKEND       SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES
webapp  books:7002    books-create  HTTPRoute                 73.17%  2.05         98ms        460ms        492ms     0.00%   34.22%
                      └─────────────────────►  books:7002     48.13%  3.12         29ms         93ms         99ms     0.00%
webapp  books:7002    books-list    HTTPRoute                100.00%  1.50         25ms         48ms         49ms     0.00%    0.00%
                      └─────────────────────►  books:7002    100.00%  1.50         12ms         24ms         25ms     0.00%
webapp  books:7002    books-delete  HTTPRoute                100.00%  0.73         25ms         48ms         50ms     0.00%    0.00%
                      └─────────────────────►  books:7002    100.00%  0.73         12ms         24ms         25ms     0.00%
webapp  authors:7001  [default]                              100.00%  2.98         25ms         48ms         50ms     0.00%    0.00%
                      └─────────────────────►  authors:7001  100.00%  2.98         16ms         44ms         49ms     0.00%

Notice that while the success rate of individual requests to the books backend on the books-create route only have a success rate of about 50%, the overall success rate on that route has been raised to 73% due to retries. We can also see that 34.22% of the requests on this route are retries and that the improved success rate has come at the expense of additional RPS to the backend and increased overall latency.

By default, Linkerd will only attempt 1 retry per failure. We can improve success rate further by increasing this limit to allow more than 1 retry per request:

kubectl -n booksapp annotate httproutes.gateway.networking.k8s.io/books-create \
retry.linkerd.io/limit=3

Looking at the stats again:

linkerd viz -n booksapp stat-outbound deploy/webapp
NAME    SERVICE       ROUTE         TYPE       BACKEND       SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES
webapp  books:7002    books-delete  HTTPRoute                100.00%  0.75         25ms         48ms         50ms     0.00%    0.00%
                      └─────────────────────►  books:7002    100.00%  0.75         12ms         24ms         25ms     0.00%
webapp  authors:7001  [default]                              100.00%  2.92         25ms         48ms         50ms     0.00%    0.00%
                      └─────────────────────►  authors:7001  100.00%  2.92         18ms         46ms         49ms     0.00%
webapp  books:7002    books-create  HTTPRoute                 92.78%  1.62        111ms        461ms        492ms     0.00%   47.28%
                      └─────────────────────►  books:7002     48.91%  3.07         42ms        179ms        236ms     0.00%
webapp  books:7002    books-list    HTTPRoute                100.00%  1.45         25ms         48ms         50ms     0.00%    0.00%
                      └─────────────────────►  books:7002    100.00%  1.45         12ms         24ms         25ms     0.00%

We see that these additional retries have increased the overall success rate on this route to 92.78%.

Timeouts

Linkerd can limit how long to wait before failing outgoing requests to another service. For the purposes of this demo, let’s set a 15ms timeout for calls to the books-create route:

kubectl -n booksapp annotate httproutes.gateway.networking.k8s.io/books-create \
timeout.linkerd.io/request=15ms

(You may need to adjust the timeout value depending on your cluster – 15ms should definitely show some timeouts, but feel free to raise it if you’re getting so many that it’s hard to see what’s going on!)

We can see the effects of this timeout by running:

linkerd viz -n booksapp stat-outbound deploy/webapp
NAME    SERVICE       ROUTE         TYPE       BACKEND       SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES
webapp  authors:7001  [default]                              100.00%  2.85         26ms         49ms        370ms     0.00%    0.00%
                      └─────────────────────►  authors:7001  100.00%  2.85         19ms         49ms         86ms     0.00%
webapp  books:7002    books-create  HTTPRoute                 78.90%  1.82         45ms        449ms        490ms    21.10%   47.34%
                      └─────────────────────►  books:7002     41.55%  3.45         24ms        134ms        227ms    11.11%
webapp  books:7002    books-list    HTTPRoute                100.00%  1.40         25ms         47ms         49ms     0.00%    0.00%
                      └─────────────────────►  books:7002    100.00%  1.40         12ms         24ms         25ms     0.00%
webapp  books:7002    books-delete  HTTPRoute                100.00%  0.70         25ms         48ms         50ms     0.00%    0.00%
                      └─────────────────────►  books:7002    100.00%  0.70         12ms         24ms         25ms     0.00%

We see that 21.10% of the requests are hitting this timeout.

Clean Up

To remove the books app and the booksapp namespace from your cluster, run:

kubectl delete ns booksapp