• GitHub
  • Slack
  • Linkerd Forum

Automatic Multicluster Failover

The Linkerd Failover extension is a controller which automatically shifts traffic from a primary service to one or more fallback services whenever the primary becomes unavailable. This can help add resiliency when you have a service which is replicated in multiple clusters. If the local service is unavailable, the failover controller can shift that traffic to the backup cluster.

Let’s see a simple example of how to use this extension by installing the Emojivoto application on two Kubernetes clusters and simulating a failure in one cluster. We will see the failover controller shift traffic to the other cluster to ensure the service remains available.

Linkerd Production Tip

This page contains best-effort instructions by the open source community. Production users with mission-critical applications should familiarize themselves with Linkerd production resources and/or connect with a commercial Linkerd provider.

Prerequisites

You will need two clusters with Linkerd installed and for the clusters to be linked together with the multicluster extension. Follow the steps in the multicluster guide to generate a shared trust root, install Linkerd, Linkerd Viz, and Linkerd Multicluster, and to link the clusters together. For the remainder of this guide, we will assume the cluster context names are “east” and “west” respectively. Please substitute your cluster context names where appropriate.

Installing the Failover Extension

Failovers are described using SMI TrafficSplit resources. We install the Linkerd SMI extension and the Linkerd Failover extension. These can be installed in both clusters, but since we’ll only be initiating failover from the “west” cluster in this example, we’ll only install them in that cluster:

# Install linkerd-smi in west cluster > helm --kube-context=west repo add linkerd-smi https://linkerd.github.io/linkerd-smi > helm --kube-context=west repo up > helm --kube-context=west install linkerd-smi -n linkerd-smi --create-namespace linkerd-smi/linkerd-smi # Install linkerd-failover in west cluster > helm --kube-context=west repo add linkerd-edge https://helm.linkerd.io/edge > helm --kube-context=west repo up > helm --kube-context=west install linkerd-failover -n linkerd-failover --create-namespace --devel linkerd-edge/linkerd-failover

Installing and Exporting Emojivoto

We’ll now install the Emojivoto example application into both clusters:

> linkerd --context=west inject https://run.linkerd.io/emojivoto.yml | kubectl --context=west apply -f - > linkerd --context=east inject https://run.linkerd.io/emojivoto.yml | kubectl --context=east apply -f -

Next we’ll “export” the web-svc in the east cluster by setting the mirror.linkerd.io/exported=true label. This will instruct the multicluster extension to create a mirror service called web-svc-east in the west cluster, making the east Emojivoto application available in the west cluster:

> kubectl --context=east -n emojivoto label svc/web-svc mirror.linkerd.io/exported=true > kubectl --context=west -n emojivoto get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE emoji-svc ClusterIP 10.96.41.137 <none> 8080/TCP,8801/TCP 13m voting-svc ClusterIP 10.96.247.68 <none> 8080/TCP,8801/TCP 13m web-svc ClusterIP 10.96.222.169 <none> 80/TCP 13m web-svc-east ClusterIP 10.96.244.245 <none> 80/TCP 92s

Creating the Failover TrafficSplit

To tell the failover controller how to failover traffic, we need to create a TrafficSplit resource in the west cluster with the failover.linkerd.io/controlled-by: linkerd-failover label. The failover.linkerd.io/primary-service annotation indicates that the web-svc backend is the primary and all other backends will be treated as the fallbacks:

kubectl --context=west apply -f - <<EOF apiVersion: split.smi-spec.io/v1alpha2 kind: TrafficSplit metadata: name: web-svc-failover namespace: emojivoto labels: failover.linkerd.io/controlled-by: linkerd-failover annotations: failover.linkerd.io/primary-service: web-svc spec: service: web-svc backends: - service: web-svc weight: 1 - service: web-svc-east weight: 0 EOF

This TrafficSplit indicates that the local (west) web-svc should be used as the primary, but traffic should be shifted to the remote (east) web-svc-east if the primary becomes unavailable.

Testing the Failover

We can use the linkerd viz stat command to see that the vote-bot traffic generator in the west cluster is sending traffic to the local primary service, web-svc:

> linkerd --context=west viz stat -n emojivoto svc --from deploy/vote-bot NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN web-svc - 96.67% 2.0rps 2ms 3ms 5ms 1 web-svc-east - - - - - - -

Now we’ll simulate the local service becoming unavailable by scaling it down:

> kubectl --context=west -n emojivoto scale deploy/web --replicas=0

We can immediately see that the TrafficSplit has been adjusted to send traffic to the backup. Notice that the web-svc backend now has weight 0 and the web-svc-east backend now has weight 1.

> kubectl --context=west -n emojivoto get ts/web-svc-failover -o yaml apiVersion: split.smi-spec.io/v1alpha2 kind: TrafficSplit metadata: annotations: failover.linkerd.io/primary-service: web-svc creationTimestamp: "2022-03-22T23:47:11Z" generation: 4 labels: failover.linkerd.io/controlled-by: linkerd-failover name: web-svc-failover namespace: emojivoto resourceVersion: "10817806" uid: 77039fb3-5e39-48ad-b7f7-638d187d7a28 spec: backends: - service: web-svc weight: 0 - service: web-svc-east weight: 1 service: web-svc

We can also confirm that this traffic is going to the fallback using the viz stat command:

> linkerd --context=west viz stat -n emojivoto svc --from deploy/vote-bot NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN web-svc - - - - - - - web-svc-east - 93.04% 1.9rps 25ms 30ms 30ms 1

Finally, we can restore the primary by scaling its deployment back up and observe the traffic shift back to it:

> kubectl --context=west -n emojivoto scale deploy/web --replicas=1 deployment.apps/web scaled > linkerd --context=west viz stat -n emojivoto svc --from deploy/vote-bot NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN web-svc - 89.29% 1.9rps 2ms 4ms 5ms 1 web-svc-east - - - - - - -