Workshop recap: Dynamic Request Routing and Circuit Breaking

Cover hu17386443984403409316

This blog post is based on a workshop that I recently delivered at Buoyant’s Service Mesh Academy. If this seems interesting, check out the full recording!

Linkerd 2.13 adds two long-requested features to Linkerd: dynamic request routing and circuit breaking.

  • Dynamic request routing permits HTTP routing based on headers, HTTP method, etc.

  • Circuit breaking is a resilience feature that allows Linkerd to stop sending requests to endpoints that fail too much.

While Linkerd 2.12 has been able to do some dynamic request routing, Linkerd 2.13 expands quite a bit on the feature. Circuit breaking is completely new in Linkerd 2.13.

Dynamic Request Routing

In Linkerd 2.11 and earlier, the only mechanism for any sort of dynamic routing used the TrafficSplit extension and the linkerd-smi extension to support a coarse-grained routing behavior based on the service name and the desired percentage of traffic to be split. For example:

  • Progressive delivery: 1% of the requests to the foo workload are sent to a new version (foo-new), while the remaining 99% continue to be routed to the original version. If all goes well, the percentages are shifted over time until all the requests are going to foo-new.

  • Multi-cluster/failover: All of the requests to the foo workload get routed to a different cluster via a mirrored foo-west Service.

Linkerd 2.12 introduced support for basic header-based routing, using the HTTPRoute CRD from the Gateway API. This allowed for routing based on the value of a header, but it didn’t support weighted routing at the same time.

Dynamic request routing in Linkerd 2.13 brings these two worlds together using the HTTPRoute CRD, and expands it further by supporting weighted routing based on request headers, verbs, or other attributes of the request (though not the request body). This is much more powerful than what was possible with 2.12 and earlier. For example:

  • Progressive delivery is possible without using the linkerd-smi extension at all.

  • Progressive delivery can be combined with header-based routing, for example per-user canaries: use a header to select a particular group of users, then canary only that group of users using a new version of a workload. This enables early rollout of a new feature only for a specific group of users, while most users continue to use the stable version.

  • A/B testing anywhere in the call graph: Since dynamic request routing permits separating traffic based on headers or verbs, it’s possible to split users into multiple groups and route each group to a distinct version of a workload. This allows for experimentation and comparison of different implementations or features.

The Gateway API is a Kubernetes SIG-Networking project started in 2020, primarily to address the challenges related to the proliferation of annotations in use on the Ingress resource. In 2022, the Gateway API project began the GAMMA (Gateway API for Mesh Management and Administration) initiative to explore how to use the Gateway API for mesh networking. Linkerd is an active participant in both efforts: the power and flexibility of the Gateway API makes it easier to expand Linkerd’s capabilities while maintaining its overall best-in-class operational simplicity.

One important caveat, though, is that since the Gateway API was originally designed to manage ingress traffic - traffic from outside the cluster coming in - its conformance tests are not yet well-suited to service meshes, so Linkerd can’t yet be fully conformant with the Gateway API. For this reason, Linkerd uses the HTTPRoute resource in the policy.linkerd.io APIGroup, rather than the official Gateway API APIgroup. There’s work actively underway to improve this situation.

Dynamic Request Routing Examples

First, a simple canary example. This example does a 50/50 split for requests to the color Service, routing half to the endpoints being the actual color Service and half to those behind the color2 Service.

apiVersion: policy.linkerd.io/v1beta2
kind: HTTPRoute
metadata:
  name: color-canary
  namespace: faces
spec:
  parentRefs:
    - name: color
      kind: Service
      group: core
      port: 80 # Match port numbers with what’s in the Service resource
  rules:
    - backendRefs:
        - name: color
          port: 80
          weight: 50 # Adjust the weights to control balancing
    - backendRefs:
        - name: color2
          port: 80
          weight: 50

I’m being careful here about the distinction between a Service and the endpoints behind the Service, because the HTTPRoute acts on requests sent to a particular service, routing them to endpoints behind a service. This is why having color in the parentRefs stanza and also in one of the backendRefs stanzas works, without creating a loop.

Here’s an example of A/B testing. Here, requests sent to the smiley Service with the header

X-Faces-User: testuser

get routed to endpoints behind the smiley2 Service, while other requests continue on to endpoints behind the smiley Service.

apiVersion: policy.linkerd.io/v1beta2
kind: HTTPRoute
metadata:
  name: smiley-a-b
  namespace: faces
spec:
  parentRefs:
    - name: smiley
      kind: Service
      group: core
      port: 80
  rules:
    - matches:
        - headers:
            - name: "x-faces-user" # X-Faces-User: testuser goes to smiley2
              value: "testuser"
      backendRefs:
        - name: smiley2
          port: 80
    - backendRefs:
        - name: smiley
          port: 80

One critical point about the A/B test: Linkerd can do dynamic request routing anywhere, but of course if you want to route on a header, you need to make sure that header is present at the place you want to use it for routing! This may mean that you need to be careful to propagate headers through the various workloads of your application.

You can find more details about dynamic request routing in its documentation, at https://linkerd.io/2/tasks/configuring-dynamic-request-routing/.

Circuit Breaking

Circuit breaking is new to Linkerd 2.13, but it’s been long requested by users. It’s a mechanism to try to avoid overwhelming a failing workload endpoint with additional traffic:

  • A workload endpoint starts to fail.
  • Linkerd detects failures from the endpoint and temporarily stops routing requests to that endpoint (opening the breaker).
  • After a little while, a test request is sent.
  • If the test succeeds, thecircuit breaker is closed again, allowing requests to resume being delivered.

In Linkerd 2.13, circuit breaking is a little limited:

  • Circuit breakers can only be opened when a certain number of consecutive failures occur.
  • “Failure” means an HTTP 5xx response; Linkerd doesn’t currently support response classification for circuit breakers.
  • Circuit breakers are configured through annotations on a Service, with all the relevant annotations containing the term “failure-accrual” in their names (from the internal name for circuit breaking in the code).

Circuit breakers in Linkerd are expected to gain functionality rapidly, so keep an eye out as new releases happen (and the annotation approach should be supplanted with Gateway API CRDs).

Circuit Breaking Example

To break the circuit after four consecutive request failures, apply these annotations to a Service:

balancer.linkerd.io/failure-accrual: consecutive
balancer.linkerd.io/failure-accrual-consecutive-max-failures: 4

The failure-accrual: consecutive annotation switches on circuit breaking, and sets it to the “consecutive failure” mode (which is the only supported mode in 2.13).

All configuration for the “consecutive failure” mode of circuit breaking uses annotations that start with failure-accrual-consecutive-; the failure-accrual-consecutive-max-failures annotation sets the number of consecutive failures after which the circuit breaker will open.

Try reenabling traffic after 30 seconds:

balancer.linkerd.io/failure-accrual-consecutive-min-penalty: 30s

(This is for the first attempt. After that, the delay grows exponentially.)

Don’t ever wait more than 120 seconds between retries:

balancer.linkerd.io/failure-accrual-consecutive-max-penalty: 120s

More information on circuit breaking is available in its documentation, at https://linkerd.io/2/tasks/circuit-breakers/.

Gotchas

The biggest gotcha of them all is that in Linkerd 2.13, ServiceProfiles do not compose with dynamic request routing and circuit breaking.

Getting specific, this means that when a ServiceProfile defines routes, it takes precedence over other HTTPRoutes with conflicting routes, and it also takes precedence over circuit breakers associated with the workloads referenced in the ServiceProfile. This is expected to be the case for the foreseeable future, to minimize surprises when upgrading from a version of Linkerd without the new features.

The challenge here, of course, is that there are still several things that require ServiceProfiles in Linkerd 2.13 (for example, retries and timeouts). The Linkerd team is actively working to quickly make all of this better, with a particular short-term focus on rapidly bringing HTTPRoutes to feature parity with ServiceProfiles.

Debugging Dynamic Request Routing and Circuit Breaking

The most typical failure you’ll see when trying to use these new features is to enable a new feature and see that it doesn’t seem to be active. There are some simple rules of thumb for debugging:

  • First, check for ServiceProfiles. Remember that conflicting ServiceProfiles will always disable HTTPRoutes or circuit breakers.

  • Second, you may need to restart Pods after removing conflicting ServiceProfiles. This is because the Linkerd proxy needs to determine whether it is running in 2.12 mode or 2.13 mode, and in some situations it’s still possible for it not to shift between modes smoothly.

  • Finally, there’s a new linkerd diagnostics policy command, which will dump a large amount of internal Linkerd state describing what exactly the control plane is doing with routing. It’s extremely verbose, but can show you an enormous amount of information that can help with debugging problems.

Dynamic Request Routing and Circuit Breaking

Taken together, dynamic request routing and circuit breaking are two important new additions to Linkerd 2.13. While still a bit limited in 2.13, keep an eye out: we have big plans for these features as Linkerd’s development continues.


If you want more on this topic, check out the Circuit Breaking and Dynamic Request Routing Deep Dive Service Mesh Academy workshop for hands-on exploration of everything I’ve talked about here! And, as always, feedback is always welcome – you can find me as @flynn on the Linkerd Slack.

Suggested Blog Posts

Thumbnail hu5675677233429810234