Service Mesh Architecture

Architecture

What is a Service Mesh?

A service mesh is a dedicated infrastructure layer that manages service-to-service communication in a microservices architecture. It handles cross-cutting concerns like service discovery, load balancing, failure recovery, metrics collection, and security without requiring changes to application code.

Core premise: As microservices proliferate, each service reimplementing networking logic (retries, timeouts, circuit breakers, encryption) creates duplication and inconsistency. A service mesh centralizes these concerns in infrastructure.

Service mesh architecture:

Application Container          Sidecar Proxy
┌─────────────────────┐       ┌──────────────────┐
│  Business Logic     │◀─────▶│  Envoy Proxy     │
│  (Your Code)        │       │  - Routing       │
└─────────────────────┘       │  - Retries       │
                              │  - Encryption    │
                              │  - Metrics       │
                              └──────────────────┘
                                      ▲
                                      │
                                      ▼
                              ┌──────────────────┐
                              │  Control Plane   │
                              │  - Configuration │
                              │  - Telemetry     │
                              │  - Service Disc. │
                              └──────────────────┘

Key components:

Data Plane: Network proxies (sidecars) deployed alongside each service instance. Handle all network traffic, enforce policies, collect telemetry.

Control Plane: Manages and configures data plane proxies. Provides service discovery, certificate management, policy distribution, telemetry aggregation.

Core Service Mesh Capabilities

Traffic Management

Control how requests flow between services.

Traffic routing:

  • Request routing: Route based on headers, paths, query parameters
  • Load balancing: Distribute requests across instances (round-robin, least-request, weighted)
  • Traffic splitting: Canary deployments, A/B testing (90% to v1, 10% to v2)
  • Traffic mirroring: Copy production traffic to test environment

Example: Canary deployment

Incoming requests
       │
       ▼
┌─────────────────┐
│  Service Mesh   │
│  Traffic Split  │
└─────────────────┘
       │
       ├─────90%────▶ Service v1 (stable)
       │
       └─────10%────▶ Service v2 (canary)

Resilience patterns:

  • Timeouts: Prevent indefinite waiting
  • Retries: Automatic retry with exponential backoff
  • Circuit breakers: Stop calling failing services
  • Rate limiting: Protect services from overload
  • Bulkheading: Isolate failures to prevent cascade

Service Discovery

Dynamically discover and connect to service instances.

How it works:

  1. Service registers with control plane
  2. Control plane maintains service registry
  3. Sidecars query control plane for service endpoints
  4. Control plane pushes configuration updates to sidecars

Benefits over DNS:

  • Faster updates: No DNS TTL delays
  • Health checking: Only route to healthy instances
  • Rich metadata: Service version, zone, custom labels
  • Dynamic configuration: Update routing without DNS changes

Security

Secure service-to-service communication without application code changes.

Mutual TLS (mTLS):

  • Encrypt all traffic between services
  • Authenticate both client and server
  • Automatic certificate rotation
  • Zero-trust networking (never trust, always verify)
Service A                      Service B
┌────────────┐                ┌────────────┐
│  App       │                │  App       │
└────────────┘                └────────────┘
      │                             │
      ▼                             ▼
┌────────────┐   mTLS Channel  ┌────────────┐
│ Sidecar A  │◀───────────────▶│ Sidecar B  │
│ (TLS cert) │   Encrypted     │ (TLS cert) │
└────────────┘                 └────────────┘

Authorization policies:

  • Define which services can communicate
  • Enforce at the proxy level (before reaching application)
  • Fine-grained rules (by service, method, path)

Observability

Gain visibility into service-to-service communication.

Distributed tracing:

  • Automatic trace generation for requests
  • Correlate requests across service boundaries
  • Identify latency bottlenecks
  • Visualize request flow

Metrics collection:

  • Request rate, error rate, latency (RED metrics)
  • Per-service, per-endpoint granularity
  • No code instrumentation required
  • Export to Prometheus, Grafana, CloudWatch

Access logging:

  • Log all requests with source, destination, status, latency
  • Centralized logging for debugging
  • Audit trail for security compliance

When to Adopt a Service Mesh

Service meshes add complexity. Adopt when benefits justify the cost.

When Service Mesh Makes Sense

Large-scale microservices:

  • Dozens or hundreds of services
  • Service-to-service communication is complex
  • Inconsistent implementation of resilience patterns across services

Polyglot architecture:

  • Multiple languages and frameworks
  • Cannot standardize on one library for networking concerns
  • Need consistent observability across all services

Security requirements:

  • Zero-trust networking mandated
  • Encryption in transit required
  • Fine-grained authorization needed

Operational challenges:

  • Difficulty debugging distributed failures
  • Lack of visibility into service dependencies
  • Inconsistent retry and timeout behavior

When Service Mesh Adds Unnecessary Complexity

Small number of services:

  • Fewer than 10-15 services
  • Service communication patterns are simple
  • Libraries or frameworks handle networking adequately

Homogeneous stack:

  • All services use same language/framework
  • Shared libraries provide consistent networking behavior
  • Observability already standardized

Limited operational resources:

  • Small team cannot maintain additional infrastructure
  • Lack of Kubernetes or containerization expertise
  • Cloud-managed services already provide needed features

Simple deployments:

  • Infrequent deployments
  • No need for canary or blue-green deployments
  • Static service topology

Service Mesh vs API Gateway

Service meshes and API gateways solve different problems.

Aspect API Gateway Service Mesh
Traffic direction North-south (external → internal) East-west (service → service)
Primary use case Expose APIs to external clients Manage internal service communication
Location Edge of network Between services
Authentication API keys, OAuth, JWT validation mTLS, service identity
Rate limiting Per-client, per-API-key Per-service, per-endpoint
Protocol translation REST → gRPC, HTTP → messaging Typically same protocol
Deployment Centralized gateway cluster Sidecar per service

You often need both:

External Clients
       │
       ▼
┌─────────────────┐
│  API Gateway    │  ◀─── North-south traffic
│  - Auth         │
│  - Rate limit   │
│  - API keys     │
└─────────────────┘
       │
       ▼
Internal Services
       │
   ┌───┴────┬──────────┐
   ▼        ▼          ▼
┌──────┐ ┌──────┐  ┌──────┐
│ Svc A│ │ Svc B│  │ Svc C│
└──────┘ └──────┘  └──────┘
   ▲        ▲          ▲
   │        │          │
   └────────┴──────────┘
      Service Mesh    ◀─── East-west traffic
      - mTLS
      - Retries
      - Observability

Use API Gateway for: External API management, client authentication, request/response transformation.

Use Service Mesh for: Internal service resilience, mTLS, observability, traffic management.

Service Mesh Patterns

Sidecar Pattern

Deploy a proxy container alongside each application container.

How it works:

  • Application sends traffic to localhost
  • Sidecar intercepts traffic (via iptables or transparent proxy)
  • Sidecar handles networking concerns
  • Sidecar forwards to destination service

Benefits:

  • Application code unchanged
  • Polyglot support (any language works)
  • Upgrade networking independently of application

Challenges:

  • Resource overhead (CPU, memory per sidecar)
  • Additional container per pod (in Kubernetes)
  • Debugging complexity (extra layer of indirection)

Ingress and Egress Gateways

Control traffic entering and leaving the mesh.

Ingress Gateway: Entry point for external traffic into the mesh. Replaces traditional ingress controllers or load balancers.

Egress Gateway: Controlled exit point for traffic leaving the mesh to external services.

External Request                External Service
       │                               ▲
       ▼                               │
┌─────────────────┐           ┌────────────────┐
│ Ingress Gateway │           │ Egress Gateway │
│ - TLS term      │           │ - mTLS term    │
│ - AuthN/AuthZ   │           │ - Monitoring   │
└─────────────────┘           └────────────────┘
       │                               ▲
       │                               │
       │      Service Mesh             │
       └───────────┬───────────────────┘
                   │
            ┌──────┴──────┐
            ▼             ▼
         Service A    Service B

Egress gateway benefits:

  • Monitor and control traffic to external services
  • Apply policies to external calls
  • Centralized external API credentials
  • Prevent direct external access from services

Multi-Cluster Service Mesh

Extend service mesh across multiple Kubernetes clusters.

Use cases:

  • Multi-region deployments for low latency
  • Disaster recovery and failover
  • Gradual migration between clusters
  • Separate dev/staging/prod clusters with shared mesh

Challenges:

  • Network connectivity between clusters
  • Certificate and identity federation
  • Increased latency for cross-cluster calls
  • Complexity of debugging multi-cluster issues

Istio

Open-source service mesh originally developed by Google, IBM, and Lyft. Most feature-rich, most complex.

Architecture:

  • Data plane: Envoy proxy sidecars
  • Control plane: istiod (single binary consolidating Pilot, Citadel, Galley)

Strengths:

  • Comprehensive feature set
  • Strong security (mTLS, RBAC, authorization policies)
  • Advanced traffic management (weighted routing, mirroring)
  • Rich observability integration (Prometheus, Grafana, Jaeger, Kiali)
  • Large ecosystem and community

Challenges:

  • Steep learning curve
  • Resource intensive (control plane and sidecars)
  • Complex configuration (CRDs, YAML)
  • Performance overhead (powerful but heavier than alternatives)

Best for: Large organizations, complex multi-cluster deployments, comprehensive feature requirements.

Linkerd

Lightweight, Kubernetes-native service mesh focused on simplicity and performance.

Architecture:

  • Data plane: Linkerd2-proxy (custom Rust-based proxy, faster than Envoy)
  • Control plane: Minimal components (destination, identity, proxy-injector)

Strengths:

  • Simple to install and operate
  • Low resource overhead
  • Fast (optimized Rust proxy)
  • Security by default (automatic mTLS)
  • Excellent observability dashboard

Challenges:

  • Kubernetes-only (no support for VMs or other platforms)
  • Fewer features than Istio
  • Smaller ecosystem and community

Best for: Kubernetes-focused organizations, teams prioritizing simplicity, performance-sensitive workloads.

AWS App Mesh

AWS-managed service mesh for ECS, EKS, and EC2.

Architecture:

  • Data plane: Envoy proxy sidecars
  • Control plane: Managed by AWS

Strengths:

  • Integrated with AWS services (CloudWatch, X-Ray, CloudMap)
  • No control plane to manage
  • Works across ECS, EKS, and EC2 (VM-based services)
  • Pay-as-you-go pricing (no upfront cost)

Challenges:

  • AWS-specific (vendor lock-in)
  • Fewer features than Istio
  • Less mature than open-source alternatives

Best for: AWS-centric organizations, teams using ECS and EKS together, preference for managed services.

Official documentation: AWS App Mesh Documentation

Consul Connect

Service mesh from HashiCorp, part of Consul service discovery platform.

Architecture:

  • Data plane: Envoy or built-in proxy
  • Control plane: Consul servers

Strengths:

  • Multi-platform (Kubernetes, VMs, bare metal)
  • Multi-cloud and hybrid cloud support
  • Integrated with Consul service catalog
  • Strong multi-datacenter federation

Challenges:

  • Requires Consul knowledge
  • Less Kubernetes-native than Istio/Linkerd
  • Smaller community than Istio

Best for: Multi-cloud deployments, hybrid environments with VMs and containers, existing Consul users.

Service Mesh on AWS ECS (App Mesh)

AWS App Mesh is the recommended service mesh for Amazon ECS. It provides service mesh capabilities for containerized applications running on ECS (Fargate or EC2).

App Mesh Architecture on ECS

┌──────────────────────────────────────────────┐
│             AWS App Mesh                     │
│         (Control Plane - Managed)            │
└──────────────────────────────────────────────┘
                    │
        ┌───────────┴────────────┐
        ▼                        ▼
┌─────────────────┐      ┌─────────────────┐
│  ECS Service A  │      │  ECS Service B  │
│  ┌───────────┐  │      │  ┌───────────┐  │
│  │ App       │  │      │  │ App       │  │
│  │ Container │  │      │  │ Container │  │
│  └───────────┘  │      │  └───────────┘  │
│  ┌───────────┐  │      │  ┌───────────┐  │
│  │ Envoy     │  │      │  │ Envoy     │  │
│  │ Sidecar   │  │      │  │ Sidecar   │  │
│  └───────────┘  │      │  └───────────┘  │
└─────────────────┘      └─────────────────┘

App Mesh Core Concepts

Mesh: Logical boundary for service mesh (typically one per application or environment). Defines egress filtering policy.

Virtual Service: Abstract name for a service (e.g., order-service.mesh.local). Clients call virtual services, not actual task IPs. Decouples service consumers from providers.

Virtual Node: Represents a logical service (ECS service or task group). Configures health checks, backends it can call, and service discovery mechanism.

Virtual Router: Routes traffic to virtual nodes based on rules (weighted routing, header matching). Enables canary deployments and A/B testing.

Route: Defines routing logic within a virtual router (HTTP routes, gRPC routes, TCP routes). Specifies retry policies, timeouts, and traffic distribution.

Virtual Gateway: Entry point for traffic from outside the mesh to services inside the mesh.

App Mesh Implementation Considerations

ECS Task Definition Setup:

  • Configure proxy configuration to integrate Envoy sidecar
  • Define app container dependencies on Envoy health check
  • Include X-Ray daemon container for distributed tracing
  • Set proper IAM roles for App Mesh API access

Service Discovery Integration:

  • App Mesh integrates with AWS Cloud Map for service discovery
  • Virtual nodes reference Cloud Map services
  • Enables dynamic endpoint discovery as tasks scale

Traffic Management Capabilities:

  • Weighted routing: Distribute traffic across service versions (90/10, 50/50, etc.)
  • Header-based routing: Route to different versions based on request headers
  • Retry policies: Configure automatic retries on failures with exponential backoff
  • Timeout policies: Set per-request and idle timeouts to prevent hanging requests

Security Features:

  • mTLS encryption: Configure TLS certificates from AWS Certificate Manager (ACM)
  • Client policy: Enforce TLS validation for outbound connections
  • IAM integration: Use IAM task roles to control access to App Mesh APIs
  • Egress filtering: Block or allow traffic to external services

Observability Integration:

  • CloudWatch Metrics: Envoy exports metrics automatically
  • AWS X-Ray: Distributed tracing for request paths
  • CloudWatch Logs: Access logs for all service-to-service calls
  • CloudWatch Alarms: Set alerts on error rates and latency

Canary Deployment Strategy:

  1. Deploy new version as separate virtual node
  2. Create route with small weight to canary (10%)
  3. Monitor error rates and latency metrics
  4. Gradually increase canary weight (25%, 50%, 75%, 100%)
  5. Remove old version after successful rollout

Service Mesh on AWS EKS

EKS supports multiple service mesh options. Istio and Linkerd are the most popular.

Istio on EKS

Istio provides comprehensive service mesh capabilities with extensive features. It uses Envoy as the data plane proxy and a centralized control plane (istiod).

Installation approach: Use the Istio CLI (istioctl) or Helm charts to install the control plane and configure automatic sidecar injection for namespaces.

Core configuration resources:

  • VirtualService: Define routing rules (traffic splitting, header-based routing, retries, timeouts)
  • DestinationRule: Configure load balancing, connection pools, circuit breaking, TLS settings
  • Gateway: Define ingress/egress points for the mesh
  • PeerAuthentication: Enforce mTLS requirements
  • AuthorizationPolicy: Define which services can communicate

Traffic Management Patterns:

  • Canary deployments: Route percentage of traffic to new version based on weight
  • Header-based routing: Route beta users to new version while others use stable
  • Traffic mirroring: Copy production traffic to test environment for validation
  • Fault injection: Intentionally introduce delays or errors for chaos testing

Security Capabilities:

  • Automatic mTLS: Encrypt all service-to-service communication automatically
  • Service-level authorization: Control which services can call which endpoints
  • External CA integration: Use external certificate authorities for identity management
  • Request authentication: Validate JWT tokens from external identity providers

Observability Tools:

  • Prometheus: Automatic metrics collection from Envoy proxies
  • Grafana: Dashboards for service metrics and mesh health
  • Jaeger: Distributed tracing to visualize request flows
  • Kiali: Service mesh topology visualization and configuration management

Performance Considerations:

  • Higher resource overhead compared to Linkerd (Envoy proxy is heavier)
  • More complex configuration (many CRDs to learn)
  • Extensive features justify overhead for large-scale deployments

Linkerd on EKS

Linkerd focuses on simplicity and performance with a lightweight Rust-based proxy.

Installation approach: Use Linkerd CLI to install control plane components and enable automatic sidecar injection.

Core configuration resources:

  • TrafficSplit: SMI (Service Mesh Interface) resource for weighted traffic distribution
  • ServiceProfile: Define routes with retries, timeouts, and response classifications
  • Server: Define what ports a service exposes and protocol details
  • ServerAuthorization: Control which services can access specific servers

Traffic Management Patterns:

  • Traffic splitting: Simple percentage-based routing between service versions
  • Retry budgets: Limit total retry percentage to prevent retry storms
  • Timeouts: Per-route timeout configuration
  • Load balancing: Exponentially weighted moving average (EWMA) by default

Security Capabilities:

  • Automatic mTLS by default: All meshed traffic encrypted without configuration
  • Policy-based authorization: Define which service accounts can access which services
  • Zero-config security: Simpler than Istio with fewer knobs to turn

Observability Tools:

  • Linkerd Viz: Built-in dashboard for service metrics and topology
  • CLI observability: Real-time traffic monitoring via CLI (tap, stat, top)
  • Prometheus integration: Export metrics for external monitoring
  • Grafana dashboards: Pre-built dashboards for Linkerd metrics

Performance Advantages:

  • Lower resource overhead (Rust proxy is lighter than Envoy)
  • Faster request processing
  • Simpler architecture reduces operational complexity

Istio vs Linkerd on EKS

Aspect Istio Linkerd
Complexity Higher (many CRDs, complex config) Lower (simpler, opinionated)
Performance Heavier (Envoy proxy) Lighter (Rust-based proxy)
Features Comprehensive Focused (core features only)
Traffic management Advanced (mirroring, fault injection) Basic (splits, retries, timeouts)
Security mTLS, authorization policies, external CA mTLS, authorization, policy
Observability Prometheus, Grafana, Jaeger, Kiali Linkerd Viz, Prometheus
Multi-cluster Advanced federation Basic multi-cluster support
Community Large, enterprise adoption Smaller, growing
Best for Complex requirements, large scale Simplicity, Kubernetes-native

Decision framework:

  • Choose Istio if you need advanced traffic management, multi-cluster support, fault injection, or comprehensive features
  • Choose Linkerd if you prioritize simplicity, performance, minimal resource overhead, and core service mesh capabilities

Official documentation:

Service Mesh Best Practices

Start Small

Don’t mesh everything at once. Start with a few services, validate the approach, then expand.

Pilot phase:

  1. Choose 2-3 non-critical services
  2. Enable mesh features incrementally (observability → mTLS → traffic management)
  3. Monitor performance impact
  4. Train team on mesh operations
  5. Document runbooks for common tasks

Monitor Resource Overhead

Service mesh adds CPU and memory overhead.

Typical overhead:

  • Sidecar proxy: 50-100MB memory, 0.1-0.5 vCPU per pod
  • Control plane: 500MB-2GB memory, 1-2 vCPUs (varies by mesh)

Mitigation:

  • Set resource requests and limits for sidecars
  • Use smaller instance types or scale out
  • Monitor and right-size based on actual usage

Use Progressive Rollouts

Enable mesh features gradually.

Rollout sequence:

  1. Observability: Metrics and tracing (low risk, high value)
  2. mTLS: Encrypted communication (medium risk, high value)
  3. Traffic management: Retries, timeouts (test thoroughly)
  4. Advanced features: Circuit breaking, rate limiting (as needed)

Implement Health Checks

Configure health checks to remove unhealthy instances from load balancing.

Health check best practices:

  • Separate liveness and readiness probes
  • Test dependencies in readiness, not liveness
  • Set appropriate thresholds (don’t be too aggressive)
  • Monitor health check failures

Establish Rollback Procedures

Know how to quickly disable mesh features or remove mesh entirely.

Rollback strategies:

  • Feature flags to disable mesh routing
  • Remove sidecar injection annotation
  • Redeploy without mesh
  • Have documented runbook for emergency rollback

Key Takeaways

Service mesh solves cross-cutting concerns: Retries, timeouts, encryption, observability without changing application code.

Sidecar pattern is foundational: Proxy container alongside each application container handles networking.

Not always necessary: Small deployments with few services may not justify the complexity.

API gateway and service mesh are complementary: Use API gateway for north-south traffic, service mesh for east-west.

Multiple implementation options: Istio (feature-rich), Linkerd (simple), App Mesh (AWS-managed), Consul Connect (multi-platform).

AWS ECS uses App Mesh: Managed control plane, Envoy sidecars, integrated with CloudWatch and X-Ray.

AWS EKS supports Istio and Linkerd: Choose Istio for advanced features, Linkerd for simplicity and performance.

Start small and iterate: Pilot with a few services, enable features progressively, monitor resource overhead.

Security by default with mTLS: Automatic encryption and authentication between services without application changes.

Observability without instrumentation: Distributed tracing, metrics, and access logs generated automatically by the mesh.

Found this guide helpful? Share it with your team:

Share on LinkedIn