Service Mesh Networking

A service mesh moves cross-cutting service-to-service concerns — encryption, retries, traffic shifting, observability — out of application code and into a layer of small proxies that sit next to every service. The result is uniform networking behavior across hundreds of services without each team having to implement TLS, circuit breaking, or distributed tracing themselves. The cost is one more layer of infrastructure to operate, an extra hop per call, and a steeper learning curve. For some architectures it pays off enormously; for others it is overkill.

The sidecar pattern

In a service mesh, every application pod runs two containers:

  1. The application itself.
  2. A sidecar proxy (typically Envoy) that handles all network traffic for that application.

The application talks only to localhost. When it makes an outbound call to order-service:8080, the call actually goes to the local sidecar, which:

  1. Discovers the actual instances of order-service.
  2. Picks one based on load balancing policy.
  3. Opens an mTLS connection to that instance's sidecar.
  4. Adds tracing headers.
  5. Handles retries on failure.
  6. Emits metrics.

The destination's sidecar terminates the connection and forwards plaintext to its application. The applications on both sides see localhost-to-localhost traffic; the mesh handles everything in between.

What the control plane does

The thousands of sidecars across a deployment are configured by a central control plane. The control plane:

  • Tracks every service instance in the cluster (typically via the orchestrator's API — Kubernetes endpoints).
  • Distributes routing rules ("90% of traffic to v1, 10% to v2").
  • Issues and rotates mTLS certificates for every workload identity.
  • Pushes policy ("service A may call service B but not service C").
  • Aggregates telemetry from sidecars.

What you get from a mesh

CapabilityWhat it does
mTLS between servicesEvery call is encrypted and mutually authenticated; zero-trust between services
Retries and timeoutsSidecar retries failed calls per policy; applications no longer need to implement
Circuit breakingSidecar stops sending to a failing destination, preventing cascading failure
Traffic shiftingCanary rollouts: 5% of traffic to a new version, gradual increase if metrics look good
Fault injectionInject errors or latency to test resilience without touching application code
Per-service observabilityRED metrics (rate, errors, duration) per service pair without instrumentation
Distributed tracing propagationTrace headers automatically forwarded across calls
Authorization policy"Service A can call service B's /read endpoints only" enforced at the proxy

The latency cost

Every call now traverses two extra processes: the caller's sidecar and the callee's sidecar. Modern proxies are fast — typical sidecar adds 1-5ms per hop — but for chains of N service calls, the overhead compounds.

Hop countAdded latency at 3ms/hop
1 service call6ms (2 sidecars × 3ms)
5-service chain30ms
20-service deep call graph120ms

For latency-sensitive workloads (real-time trading, voice apps), this matters. For typical microservice latency budgets, it's a small slice of total.

The compute cost

Each pod runs an additional process. Typical sidecar resource consumption:

  • 50-200 MB memory per sidecar.
  • 10-100 millicores CPU baseline; more under load.

For 1000 pods, that's ~150 GB of memory and tens of CPU cores spent purely on the mesh. On large clusters, the overhead is non-trivial — a meaningful fraction of total compute.

The operational cost

A mesh is critical infrastructure. The control plane must be highly available; sidecar upgrades must be coordinated; certificate rotation must work reliably; policy changes must be tested before pushing globally. Teams adopt meshes and discover they need a dedicated person or small team to operate it. The mesh becomes its own product with its own SLOs.

Sidecarless and ambient mesh

The latest direction is meshes that don't require a sidecar per pod. Instead, mesh functionality lives in a per-node agent or at the CNI layer. Benefits: lower per-pod overhead, easier rollout. Drawbacks: less isolation, more complex routing within the node.

For new deployments, evaluate both sidecar and ambient/sidecarless options. For existing sidecar deployments, migration is gradual.

Service mesh vs API gateway

LayerAPI gatewayService mesh
ScopeNorth-south (ingress from outside)East-west (between internal services)
AuthEnd-user auth (JWT, OAuth)Workload auth (mTLS)
Rate limitingPer user / per API keyPer service-to-service
AudienceExternal clientsInternal services

They are complementary, not alternatives. Most production deployments have both.

When you don't need a mesh

  • Fewer than ~10 services and simple routing needs. Direct service-to-service over HTTPS with a service registry is sufficient.
  • Single-language deployment where standard libraries provide retries, mTLS, and observability built-in.
  • Performance-critical paths where the per-call latency budget can't absorb sidecar overhead.
  • Small teams where operating the mesh itself would dominate the engineering effort it saves.

Frequently Asked Questions

What is a service mesh?

A dedicated infrastructure layer that handles service-to-service communication in a microservices deployment. Each application instance has a sidecar proxy that intercepts inbound and outbound network calls; the mesh control plane configures all the proxies centrally. The mesh handles mTLS, retries, traffic shifting, observability, and policy without requiring changes to application code.

What is a sidecar proxy?

A small process or container that runs alongside each application instance and handles all network traffic for it. Outbound calls go through the sidecar; inbound traffic terminates at the sidecar before reaching the app. The application code talks only to localhost; the sidecar does the actual cross-network work — TLS, retries, observability.

Do I need a service mesh?

Probably not until you have many services and specific needs that aren't easily met without one. Indicators: dozens of services with complex routing requirements, need for zero-trust mTLS between services, gradual rollouts and traffic shifting, or service-level observability that the application can't provide itself. Below the threshold, a mesh adds operational burden without clear gain.

What is the cost of a service mesh?

Three things: latency (each request goes through two extra proxies, adding tens of milliseconds), compute (each pod runs an additional sidecar process consuming CPU and memory), and operational complexity (the mesh is a critical infrastructure component that itself needs to be monitored, upgraded, and debugged).

What is mTLS in a service mesh?

Mutual TLS — every service-to-service connection is encrypted with TLS, and both sides present certificates and verify each other. In a mesh, the sidecars handle certificate issuance, rotation, and verification transparently. Applications make plaintext localhost calls; the sidecars upgrade the wire to mTLS. The effect is zero-trust between services with no code changes.

Related Guides

More From This Section