• GitHub
  • Slack
  • Linkerd Forum

Debugging HTTP applications with per-route metrics

This demo is of a Ruby application that helps you manage your bookshelf. It consists of multiple microservices and uses JSON over HTTP to communicate with the other services. There are three services:

  • webapp: the frontend

  • authors: an API to manage the authors in the system

  • books: an API to manage the books in the system

For demo purposes, the app comes with a simple traffic generator. The overall topology looks like this:

Topology
Topology

Prerequisites

To use this guide, you’ll need to have Linkerd installed on your cluster. Follow the Installing Linkerd Guide if you haven’t already done this.

Install the app

To get started, let’s install the books app onto your cluster. In your local terminal, run:

kubectl create ns booksapp && \
  curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/booksapp.yml \
  | kubectl -n booksapp apply -f -

This command creates a namespace for the demo, downloads its Kubernetes resource manifest and uses kubectl to apply it to your cluster. The app comprises the Kubernetes deployments and services that run in the booksapp namespace.

Downloading a bunch of containers for the first time takes a little while. Kubernetes can tell you when all the services are running and ready for traffic. Wait for that to happen by running:

kubectl -n booksapp rollout status deploy webapp

You can also take a quick look at all the components that were added to your cluster by running:

kubectl -n booksapp get all

Once the rollout has completed successfully, you can access the app itself by port-forwarding webapp locally:

kubectl -n booksapp port-forward svc/webapp 7000 >/dev/null &

(We redirect to /dev/null just so you don’t get flooded with “Handling connection” messages for the rest of the exercise.)

Open http://localhost:7000/ in your browser to see the frontend.

Frontend
Frontend

Unfortunately, there is an error in the app: if you click Add Book, it will fail 50% of the time. This is a classic case of non-obvious, intermittent failure—the type that drives service owners mad because it is so difficult to debug. Kubernetes itself cannot detect or surface this error. From Kubernetes’s perspective, it looks like everything’s fine, but you know the application is returning errors.

Failure
Failure

Add Linkerd to the service

Now we need to add the Linkerd data plane proxies to the service. The easiest option is to do something like this:

kubectl get -n booksapp deploy -o yaml \
  | linkerd inject - \
  | kubectl apply -f -

This command retrieves the manifest of all deployments in the booksapp namespace, runs them through linkerd inject, and then re-applies with kubectl apply. The linkerd inject command annotates each resource to specify that they should have the Linkerd data plane proxies added, and Kubernetes does this when the manifest is reapplied to the cluster. Best of all, since Kubernetes does a rolling deploy, the application stays running the entire time. (See Automatic Proxy Injection for more details on how this works.)

Debugging

Let’s use Linkerd to discover the root cause of this app’s failures. Linkerd’s proxy exposes rich metrics about the traffic that it processes, including HTTP response codes. The metric that we’re interested is outbound_http_route_backend_response_statuses_total and will help us identify where HTTP errors are occuring. We can use the linkerd diagnostics proxy-metrics command to get proxy metrics. Pick one of your webapp pods and run the following command to get the metrics for HTTP 500 responses:

linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \
| grep outbound_http_route_backend_response_statuses_total \
| grep http_status=\"500\"

This should return a metric that looks something like:

outbound_http_route_backend_response_statuses_total{
  parent_group="core",
  parent_kind="Service",
  parent_namespace="booksapp",
  parent_name="books",
  parent_port="7002",
  parent_section_name="",
  route_group="",
  route_kind="default",
  route_namespace="",
  route_name="http",
  backend_group="core",
  backend_kind="Service",
  backend_namespace="booksapp",
  backend_name="books",
  backend_port="7002",
  backend_section_name="",
  http_status="500",
  error=""
} 207

This counter tells us that the webapp pod received a total of 207 HTTP 500 responses from the books Service on port 7002.

HTTPRoute

We know that the webapp component is getting 500s from the books component, but it would be great to narrow this down further and get per route metrics. To do this, we take advantage of the Gateway API and define a set of HTTPRoute resources, each attached to the books Service by specifying it as their parent_ref.

kubectl apply -f - <<EOF
kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: books-list
  namespace: booksapp
spec:
  parentRefs:
    - name: books
      group: core
      kind: Service
      port: 7002
  rules:
    - matches:
        - path:
            type: Exact
            value: "/books.json"
---
kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: books-create
  namespace: booksapp
spec:
  parentRefs:
    - name: books
      group: core
      kind: Service
      port: 7002
  rules:
    - matches:
        - path:
            type: Exact
            value: "/books.json"
          method: POST
---
kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: books-delete
  namespace: booksapp
spec:
  parentRefs:
    - name: books
      group: core
      kind: Service
      port: 7002
  rules:
    - matches:
        - path:
            type: RegularExpression
            value: "/books/\\\d+.json"
          method: DELETE
EOF

We can then check that these HTTPRoutes have been accepted by their parent Service by checking their status subresource:

kubectl -n booksapp get httproutes.gateway.networking.k8s.io \
  -ojsonpath='{.items[*].status.parents[*].conditions[*]}' | jq .

Notice that the Accepted and ResolvedRefs conditions are True.

{
  "lastTransitionTime": "2024-08-03T01:38:25Z",
  "message": "",
  "reason": "Accepted",
  "status": "True",
  "type": "Accepted"
}
{
  "lastTransitionTime": "2024-08-03T01:38:25Z",
  "message": "",
  "reason": "ResolvedRefs",
  "status": "True",
  "type": "ResolvedRefs"
}
[...]

With those HTTPRoutes in place, we can look at the outbound_http_route_backend_response_statuses_total metric again, and see that the route labels have been populated:

linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \
| grep outbound_http_route_backend_response_statuses_total \
| grep http_status=\"500\"
outbound_http_route_backend_response_statuses_total{
  parent_group="core",
  parent_kind="Service",
  parent_namespace="booksapp",
  parent_name="books",
  parent_port="7002",
  parent_section_name="",
  route_group="gateway.networking.k8s.io",
  route_kind="HTTPRoute",
  route_namespace="booksapp",
  route_name="books-create",
  backend_group="core",
  backend_kind="Service",
  backend_namespace="booksapp",
  backend_name="books",
  backend_port="7002",
  backend_section_name="",
  http_status="500",
  error=""
} 212

This tells us that it is requests to the books-create HTTPRoute which have been failing.

Retries

As it can take a while to update code and roll out a new version, let’s tell Linkerd that it can retry requests to the failing endpoint. This will increase request latencies, as requests will be retried multiple times, but not require rolling out a new version. Add a retry annotation to the books-create HTTPRoute which tells Linkerd to retry on 5xx responses:

kubectl -n booksapp annotate httproutes.gateway.networking.k8s.io/books-create \
retry.linkerd.io/http=5xx

We can then see the effect of these retries by looking at Linkerd’s retry metrics:

linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \
| grep outbound_http_route_backend_response_statuses_total \
| grep retry
outbound_http_route_retry_limit_exceeded_total{...} 222
outbound_http_route_retry_overflow_total{...} 0
outbound_http_route_retry_requests_total{...} 469
outbound_http_route_retry_successes_total{...} 247

This tells us that Linkerd made a total of 469 retry requests, of which 247 were successful. The remaining 222 failed and could not be retried again, since we didn’t raise the retry limit from its default of 1.

We can improve this further by increasing this limit to allow more than 1 retry per request:

kubectl -n booksapp annotate httproutes.gateway.networking.k8s.io/books-create \
retry.linkerd.io/limit=3

Over time you will see outbound_http_route_retry_requests_total and outbound_http_route_retry_successes_total increase at a much higher rate than outbound_http_route_retry_limit_exceeded_total.

Timeouts

Linkerd can limit how long to wait before failing outgoing requests to another service. For the purposes of this demo, let’s set a 15ms timeout for calls to the books-create route:

kubectl -n booksapp annotate httproutes.gateway.networking.k8s.io/books-create \
timeout.linkerd.io/request=15ms

(You may need to adjust the timeout value depending on your cluster – 15ms should definitely show some timeouts, but feel free to raise it if you’re getting so many that it’s hard to see what’s going on!)

We can see the effects of this timeout by running:

linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \
| grep outbound_http_route_request_statuses_total | grep books-create
outbound_http_route_request_statuses_total{
  [...]
  route_name="books-create",
  http_status="",
  error="REQUEST_TIMEOUT"
} 151
outbound_http_route_request_statuses_total{
  [...]
  route_name="books-create",
  http_status="201",
  error=""
} 5548
outbound_http_route_request_statuses_total{
  [...]
  route_name="books-create",
  http_status="500",
  error=""
} 3194

Clean Up

To remove the books app and the booksapp namespace from your cluster, run:

kubectl delete ns booksapp