Beyond linkerd-viz: Linkerd Metrics with OpenTelemetry

Cover

TL;DR

Linkerd, the enterprise-grade service mesh that minimizes overhead, now integrates with OpenTelemetry, often also simply called OTel. That’s pretty cool because it allows you to collect and export Linkerd’s metrics to your favorite observability tools. This integration improves your ability to monitor and troubleshoot applications effectively. Sounds interesting? Read on.

Before we dive into this topic, I want to be sure you have a basic understanding of Kubernetes. If you’re new to it, that’s ok! But I’d recommend exploring the official Kubernetes tutorials and/or experimenting with “Kind” (Kubernetes in Docker) with this simple guide.

What’s “Linkerd”?

Ok, let’s start with the basics. Linkerd is a free and open source service mesh built in Rust—a modern programming language known for its memory safety. Simply put, a service mesh injects a proxy in a sidecar next to each container (in any relevant Kubernetes Pod), providing encryption (mutual TLS, or mTLS) and improved observability and reliability. A common use case is load balancing (that’s how Linkerd got on my radar). If your Kubernetes app communicates via gRPC, requests aren’t automatically distributed among the gRPC server pods—a capability Kubernetes doesn’t provide. Linkerd closes that gap effectively, but that’s not the topic of this blog post. If you’re interested in learning more about that, check out the gRPC Load Balancing on Kubernetes without Tears blog post. Now back to OpenTelemetry.

Linkerd’s observability extension - linkerd-viz

Linkerd has the linkerd-viz extension with a dashboard showing valuable metrics such as request/response errors, latency, and traffic volumes. This data is scraped from various pods and the Linkerd proxies running within them by a Prometheus instance included with the linkerd-viz extension.

On the other hand, the Prometheus instance supplied by linkerd-viz isn’t at all appropriate for production use. Also, while Prometheus is effective for storing and scraping telemetry data, many organizations prefer to centralize their data in Application Performance Monitoring (APM) platforms like Datadog, New Relic, or Coralogix. This means that instead of storing their observability data on Prometheus, they delegate the responsibility to a 3rd-party solution or vendor, and to achieve that, they often use OpenTelemetry.

Cool! So, what’s OpenTelemetry?

OpenTelemetry is another CNCF project. It’s an open source observability framework that provides standardized protocols and tools for collecting and routing telemetry data, including logs, metrics, and traces. While there are many projects used for storing or viewing telemetry, OTel has become the de facto standard for collecting telemetry data. To learn more about the project, check out the A Complete Introductory Guide to OpenTelemetry blog post.

OpenTelemetry defines four separate kinds of telemetry components:

  • Receivers collect data from your application.
  • Processors transform received data according to rules you define.
  • Exporters send processed data to your APM.
  • Finally, pipelines define the overall flow from receivers to processors to exporters.

What makes OpenTelemetry so powerful is the vast amount of integrations it has. The core distribution of the “otel-collector” comes with a few basic supported receivers, processors and exporters, but you can also switch to the community distribution, which comes with a ton of different integrations.

For this blog post, we’ll send those metrics to OpenObserve, a lightweight, open source observability platform that you can actually run entirely in your cluster, but feel free to use your favorite APM/observability solution of choice. We’ll also sidestep any questions about linkerd-viz and Prometheus by setting things up to scrape metrics from Linkerd proxies directly into OpenObserve—that way we don’t have to rely on linkerd-viz at all!

Ok, let’s roll up our sleeves and get started.

Installing OpenObserve

Start by installing OpenObserve by running this command:

helm repo add openobserve https://charts.openobserve.ai
helm repo update

helm install --wait --create-namespace -n openobserve \
  openobserve openobserve/openobserve-standalone

After a few seconds, you’ll be able to port forward to the UI:

kubectl port-forward -n openobserve \
  svc/openobserve-openobserve-standalone 5080:5080

Browse to localhost:5080 and use the default credentials to login:

User: [email protected]
Password: Complexpass#123

And you’re in!

Installing Linkerd

Installing Linkerd is pretty straightforward. Just follow the Getting Started Guide, and you’ll have Linkerd running in ~2–5 minutes; you can skip the linkerd-viz part—we won’t need that.

Next, inject Linkerd into your workloads (Deployments) by running (use the next command to inject it into an entire namespace):

# Inject Linkerd into a specific deployment
kubectl get -n <YOUR_NAMESPACE> deploy/<DEPLOYMENT_NAME> -o yaml \
 | linkerd inject - \
 | kubectl apply -f -

Or inject Linkerd into an entire namespace with:

kubectl get ns <YOUR_NAMESPACE> -o yaml \
 | linkerd inject - \
 | kubectl apply -f -

Followed by a restart to all workloads in that namespace

kubectl rollout restart deployment -n <YOUR_NAMESPACE>
kubectl rollout status deployment -n <YOUR_NAMESPACE>

If you follow the guide, you should have Linkerd sidecars injected into your workloads. The easiest way to verify this, is to just run:

kubectl get pods

You should see 2 containers in each pod (Ready: 2/2).

Configuring OTel

The following YAMLs deploy a basic otel-collector pipeline, configured to scrape any Linkerd control-plane or data-plane pods—let’s break those down.

First we need to create a Namespace in which the otel-collector will run. Then we’ll create a ClusterRole and ClusterRoleBinding that give the “otel-collector” ServiceAccount permissions to list Kubernetes workloads that match certain labels.

This lets otel-collector discover the different pods running with the linkerd sidecars injected.

We will then attach this ServiceAccount to the “otel-collector” deployment.

apiVersion: v1
kind: Namespace
metadata:
  name: observability
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: otel-collector
  namespace: observability
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector-read
rules:
 - apiGroups: [""]
   resources: ["pods", "endpoints", "services", "namespaces", "nodes"]
   verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector-read
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otel-collector-read
subjects:
  - kind: ServiceAccount
    name: otel-collector
    namespace: observability

Next, we’ll create a ConfigMap configuring otel-collector to scrape Linkerd-related pods for the metrics we want using a Prometheus receiver, an OTLP (OTel Protocol) exporter that sends data into OpenObserve, and a pipeline that links the two together. (We don’t define any processors since we don’t need to do any processing.)

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: observability
  labels:
    app: otel-collector
data:
  config.yaml: |
    receivers:
      prometheus/linkerd:
        config:
          global:
            evaluation_interval: 10s
            scrape_interval: 10s
            scrape_timeout: 10s
          scrape_configs:
            - job_name: 'linkerd-controller'
              kubernetes_sd_configs:
              - role: pod
                namespaces:
                  names:
                  - 'linkerd'
                  - 'linkerd-viz'
              relabel_configs:
              - source_labels:
                - __meta_kubernetes_pod_container_port_name
                action: keep
                regex: admin
              - source_labels: [__meta_kubernetes_pod_container_name]
                action: replace
                target_label: component


            - job_name: 'linkerd-multicluster-controller'
              kubernetes_sd_configs:
              - role: pod
              relabel_configs:
              - source_labels:
                - __meta_kubernetes_pod_label_component
                - __meta_kubernetes_pod_container_port_name
                action: keep
                regex: (linkerd-service-mirror|controller);admin$
              - source_labels: [__meta_kubernetes_pod_container_name]
                action: replace
                target_label: component


            - job_name: 'linkerd-proxy'
              kubernetes_sd_configs:
              - role: pod
              relabel_configs:
              - source_labels:
                - __meta_kubernetes_pod_container_name
                - __meta_kubernetes_pod_container_port_name
                - __meta_kubernetes_pod_label_linkerd_io_control_plane_ns
                action: keep
                regex: ^linkerd-proxy;linkerd-admin;linkerd$
              - source_labels: [__meta_kubernetes_namespace]
                action: replace
                target_label: namespace
              - source_labels: [__meta_kubernetes_pod_name]
                action: replace
                target_label: pod
              # special case k8s' "job" label, to not interfere with prometheus' "job"
              # label
              # __meta_kubernetes_pod_label_linkerd_io_proxy_job=foo =>
              # k8s_job=foo
              - source_labels: [__meta_kubernetes_pod_label_linkerd_io_proxy_job]
                action: replace
                target_label: k8s_job
              # drop __meta_kubernetes_pod_label_linkerd_io_proxy_job
              - action: labeldrop
                regex: __meta_kubernetes_pod_label_linkerd_io_proxy_job
              # __meta_kubernetes_pod_label_linkerd_io_proxy_deployment=foo =>
              # deployment=foo
              - action: labelmap
                regex: __meta_kubernetes_pod_label_linkerd_io_proxy_(.+)
              # drop all labels that we just made copies of in the previous labelmap
              - action: labeldrop
                regex: __meta_kubernetes_pod_label_linkerd_io_proxy_(.+)
              # __meta_kubernetes_pod_label_linkerd_io_foo=bar =>
              # foo=bar
              - action: labelmap
                regex: __meta_kubernetes_pod_label_linkerd_io_(.+)
              # Copy all pod labels to tmp labels
              - action: labelmap
                regex: __meta_kubernetes_pod_label_(.+)
                replacement: __tmp_pod_label_$1
              # Take `linkerd_io_` prefixed labels and copy them without the prefix
              - action: labelmap
                regex: __tmp_pod_label_linkerd_io_(.+)
                replacement:  __tmp_pod_label_$1
              # Drop the `linkerd_io_` originals
              - action: labeldrop
                regex: __tmp_pod_label_linkerd_io_(.+)
              # Copy tmp labels into real labels
              - action: labelmap
                regex: __tmp_pod_label_(.+)


    exporters:
      otlp/openobserve:
        endpoint: openobserve-openobserve-standalone.openobserve.svc.cluster.local:5081
        headers:
          Authorization: "Basic cm9vdEBleGFtcGxlLmNvbTpDb21wbGV4cGFzcyMxMjM="
          organization: default
          stream-name: default
        tls:
          insecure: true
      debug:
        verbosity: detailed


    service:
      pipelines:
        metrics:
          receivers: [prometheus/linkerd]
          exporters: [debug,otlp/openobserve]    

Finally, we’ll add the otel-collector Deployment with the attached ServiceAccount and the ConfigMap we’ve defined above.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: observability
  labels:
    app: otel-collector
spec:
  replicas: 1
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      serviceAccountName: otel-collector
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector:latest
          args: ["--config=/etc/otelcol/config.yaml"]
          volumeMounts:
            - name: otel-config
              mountPath: /etc/otelcol
              readOnly: true
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
          securityContext:
            runAsNonRoot: true
            allowPrivilegeEscalation: false
      volumes:
        - name: otel-config
          configMap:
            name: otel-collector-config
            items:
              - key: config.yaml
                path: config.yaml

Note the exporter -

exporters:
  otlp/openobserve:
    endpoint: openobserve-openobserve-standalone.openobserve.svc.cluster.local:5081
    headers:
      Authorization: "Basic cm9vdEBleGFtcGxlLmNvbTpDb21wbGV4cGFzcyMxMjM="
      organization: default
      stream-name: default
    tls:
      insecure: true
  debug:
    verbosity: detailed

The token:

Authorization: "Basic cm9vdEBleGFtcGxlLmNvbTpDb21wbGV4cGFzcyMxMjM="

Is just base64 encoding of the default username and password we used above.

Let’s apply the YAML by running:

kubectl apply -f otel-collector.yaml

That’s it!

The otel-collector should be up and collecting metrics from Linkerd and sending them to OpenObserve. Head over to: http://localhost:5080/web/metrics. You should see Linkerd’s metrics appear in OpenObserve. If you need to troubleshoot, check out otel-collector’s logs to see if any errors appear.

You can view the source code and a general guide to the above in its GitHub repo at https://github.com/Eli-Goldberg/linkerd-otel.

Linkerd has a ton of valuable information, such as Volumes, Success Rates, and Latency within those metrics. A lot of DevOps/Platform/SRE teams build their dashboards using their favorite APM/observability solution (mostly paid ones), especially when part of a large company. This demonstrates how you can get Linkerd’s metrics and ingest them in your favorite solution.

Bonus: Linkerd’s Auth policy

Linkerd comes with a handy feature called Authorization Policies. With it, you can enable or prevent workloads from talking to each other (think firewall) based on different parameters. For example, which services, meshed or unmeshed, on the same or separate clusters, which port is being used, etc.?

It’s worth noting that right now, all the data is visible to anyone in the cluster, which isn’t great in many cases. In production, you might want to secure access to those metrics and allow it exclusively from otel-collector.

Stay tuned for my next post about using Linkerd authorization policies to fix that.

Suggested Blog Posts