Retries and Timeouts
Timeouts and automatic retries are two of the most powerful mechanisms a service mesh has for gracefully handling partial or transient application failures.
- Timeouts allow Linkerd to cancel a request that is exceeding a time limit.
- Retries allow Linkerd to automatically retry failed requests, potentially sending it to a different endpoint.
Timeouts and retries are configured with a set of annotations, e.g
retry.linkerd.io/http
and timeout.linkerd.io/request
. These annotations are
placed on HTTPRoute or GRPCRoute resources to configure behavior on HTTP or
gRPC requests that match those resources. Alternatively, they can be placed on
Service
resources configure retries and timeouts for all traffic to that
service.
As of Linkerd 2.16, timeouts and retries compose: requests that timeout are eligible for being retried.
Note
Note
Warning
Using retries safely
Retries are an opt-in behavior that require some thought and planning. Misuse can be dangerous. First, automatically retrying a request that changes system state each time it is called can be disastrous. Thus, retries should only be used on idempotent methods, i.e. methods that have the same effect even if called multiple times.
Second, retries by definition will increase the load on your system. A set of services that have requests being constantly retried could potentially get taken down by the retries instead of being allowed time to recover.
The exact configuration of retry behavior to improve overall reliability without significantly increasing risk will require some care on the part of the user.
Per-request policies
In addition to the annotation approach outlined above, retries and timeouts can be set on a per-request basis by setting specific HTTP headers.
In order to enable this per-request policy, Linkerd must be installed with the
--set policyController.additionalArgs="--allow-l5d-request-headers"
flag or
the corresponding Helm value.
Warning
skip-inbound-ports
to instruct Linkerd to skip handling inbound traffic to the
pod), untrusted clients will be able to specify Linkerd retry and timeout policy
on their requests.Once per-request policy is enabled, you can set timeout and retry policy on individual requests by setting these headers:
l5d-retry-http
: Overrides theretry.linkerd.io/http
annotationl5d-retry-grpc
: Overrides theretry.linkerd.io/grpc
annotationl5d-retry-limit
: Overrides theretry.linkerd.io/limit
annotationl5d-retry-timeout
: Overrides theretry.linkerd.io/timeout
annotationl5d-timeout
: Overrides thetimeout.linkerd.io/request
annotationl5d-response-timeout
: Overrides thetimeout.linkerd.io/response
annotation
Further reading
- Retries reference
- Timeout reference
- The Debugging HTTP applications with per-route metrics contains examples of retries and timeout annotations.