Service Mesh Architecture

Architecture

What is a Service Mesh?

A service mesh is a dedicated infrastructure layer that manages service-to-service communication in a microservices architecture. It handles cross-cutting concerns like service discovery, load balancing, failure recovery, metrics collection, and security without requiring changes to application code.

Core premise: As microservices proliferate, each service reimplementing networking logic (retries, timeouts, circuit breakers, encryption) creates duplication and inconsistency. A service mesh centralizes these concerns in infrastructure.

Service mesh architecture:

Application Container          Sidecar Proxy
┌─────────────────────┐       ┌──────────────────┐
│  Business Logic     │◀─────▶│  Envoy Proxy     │
│  (Your Code)        │       │  - Routing       │
└─────────────────────┘       │  - Retries       │
                              │  - Encryption    │
                              │  - Metrics       │
                              └──────────────────┘
                                      ▲
                                      │
                                      ▼
                              ┌──────────────────┐
                              │  Control Plane   │
                              │  - Configuration │
                              │  - Telemetry     │
                              │  - Service Disc. │
                              └──────────────────┘

Key components:

Data Plane: Network proxies (sidecars) deployed alongside each service instance. Handle all network traffic, enforce policies, collect telemetry.

Control Plane: Manages and configures data plane proxies. Provides service discovery, certificate management, policy distribution, telemetry aggregation.

Core Service Mesh Capabilities

Traffic Management

Control how requests flow between services.

Traffic routing:

Request routing: Route based on headers, paths, query parameters
Load balancing: Distribute requests across instances (round-robin, least-request, weighted)
Traffic splitting: Canary deployments, A/B testing (90% to v1, 10% to v2)
Traffic mirroring: Copy production traffic to test environment

Example: Canary deployment

Incoming requests
       │
       ▼
┌─────────────────┐
│  Service Mesh   │
│  Traffic Split  │
└─────────────────┘
       │
       ├─────90%────▶ Service v1 (stable)
       │
       └─────10%────▶ Service v2 (canary)

Resilience patterns:

Timeouts: Prevent indefinite waiting
Retries: Automatic retry with exponential backoff
Circuit breakers: Stop calling failing services
Rate limiting: Protect services from overload
Bulkheading: Isolate failures to prevent cascade

Service Discovery

Dynamically discover and connect to service instances.

How it works:

Service registers with control plane
Control plane maintains service registry
Sidecars query control plane for service endpoints
Control plane pushes configuration updates to sidecars

Benefits over DNS:

Faster updates: No DNS TTL delays
Health checking: Only route to healthy instances
Rich metadata: Service version, zone, custom labels
Dynamic configuration: Update routing without DNS changes

Security

Secure service-to-service communication without application code changes.

Mutual TLS (mTLS):

Encrypt all traffic between services
Authenticate both client and server
Automatic certificate rotation
Zero-trust networking (never trust, always verify)

Service A                      Service B
┌────────────┐                ┌────────────┐
│  App       │                │  App       │
└────────────┘                └────────────┘
      │                             │
      ▼                             ▼
┌────────────┐   mTLS Channel  ┌────────────┐
│ Sidecar A  │◀───────────────▶│ Sidecar B  │
│ (TLS cert) │   Encrypted     │ (TLS cert) │
└────────────┘                 └────────────┘

Authorization policies:

Define which services can communicate
Enforce at the proxy level (before reaching application)
Fine-grained rules (by service, method, path)

Observability

Gain visibility into service-to-service communication.

Distributed tracing:

Automatic trace generation for requests
Correlate requests across service boundaries
Identify latency bottlenecks
Visualize request flow

Metrics collection:

Request rate, error rate, latency (RED metrics)
Per-service, per-endpoint granularity
No code instrumentation required
Export to Prometheus, Grafana, CloudWatch

Access logging:

Log all requests with source, destination, status, latency
Centralized logging for debugging
Audit trail for security compliance

When to Adopt a Service Mesh

Service meshes add complexity. Adopt when benefits justify the cost.

When Service Mesh Makes Sense

Large-scale microservices:

Dozens or hundreds of services
Service-to-service communication is complex
Inconsistent implementation of resilience patterns across services

Polyglot architecture:

Multiple languages and frameworks
Cannot standardize on one library for networking concerns
Need consistent observability across all services

Security requirements:

Zero-trust networking mandated
Encryption in transit required
Fine-grained authorization needed

Operational challenges:

Difficulty debugging distributed failures
Lack of visibility into service dependencies
Inconsistent retry and timeout behavior

When Service Mesh Adds Unnecessary Complexity

Small number of services:

Fewer than 10-15 services
Service communication patterns are simple
Libraries or frameworks handle networking adequately

Homogeneous stack:

All services use same language/framework
Shared libraries provide consistent networking behavior
Observability already standardized

Limited operational resources:

Small team cannot maintain additional infrastructure
Lack of Kubernetes or containerization expertise
Cloud-managed services already provide needed features

Simple deployments:

Infrequent deployments
No need for canary or blue-green deployments
Static service topology

Service Mesh vs API Gateway

Service meshes and API gateways solve different problems.

Aspect	API Gateway	Service Mesh
Traffic direction	North-south (external → internal)	East-west (service → service)
Primary use case	Expose APIs to external clients	Manage internal service communication
Location	Edge of network	Between services
Authentication	API keys, OAuth, JWT validation	mTLS, service identity
Rate limiting	Per-client, per-API-key	Per-service, per-endpoint
Protocol translation	REST → gRPC, HTTP → messaging	Typically same protocol
Deployment	Centralized gateway cluster	Sidecar per service

You often need both:

External Clients
       │
       ▼
┌─────────────────┐
│  API Gateway    │  ◀─── North-south traffic
│  - Auth         │
│  - Rate limit   │
│  - API keys     │
└─────────────────┘
       │
       ▼
Internal Services
       │
   ┌───┴────┬──────────┐
   ▼        ▼          ▼
┌──────┐ ┌──────┐  ┌──────┐
│ Svc A│ │ Svc B│  │ Svc C│
└──────┘ └──────┘  └──────┘
   ▲        ▲          ▲
   │        │          │
   └────────┴──────────┘
      Service Mesh    ◀─── East-west traffic
      - mTLS
      - Retries
      - Observability

Use API Gateway for: External API management, client authentication, request/response transformation.

Use Service Mesh for: Internal service resilience, mTLS, observability, traffic management.

Service Mesh Patterns

Sidecar Pattern

Deploy a proxy container alongside each application container.

How it works:

Application sends traffic to localhost
Sidecar intercepts traffic (via iptables or transparent proxy)
Sidecar handles networking concerns
Sidecar forwards to destination service

Benefits:

Application code unchanged
Polyglot support (any language works)
Upgrade networking independently of application

Challenges:

Resource overhead (CPU, memory per sidecar)
Additional container per pod (in Kubernetes)
Debugging complexity (extra layer of indirection)

Ingress and Egress Gateways

Control traffic entering and leaving the mesh.

Ingress Gateway: Entry point for external traffic into the mesh. Replaces traditional ingress controllers or load balancers.

Egress Gateway: Controlled exit point for traffic leaving the mesh to external services.

External Request                External Service
       │                               ▲
       ▼                               │
┌─────────────────┐           ┌────────────────┐
│ Ingress Gateway │           │ Egress Gateway │
│ - TLS term      │           │ - mTLS term    │
│ - AuthN/AuthZ   │           │ - Monitoring   │
└─────────────────┘           └────────────────┘
       │                               ▲
       │                               │
       │      Service Mesh             │
       └───────────┬───────────────────┘
                   │
            ┌──────┴──────┐
            ▼             ▼
         Service A    Service B

Egress gateway benefits:

Monitor and control traffic to external services
Apply policies to external calls
Centralized external API credentials
Prevent direct external access from services

Multi-Cluster Service Mesh

Extend service mesh across multiple Kubernetes clusters.

Use cases:

Multi-region deployments for low latency
Disaster recovery and failover
Gradual migration between clusters
Separate dev/staging/prod clusters with shared mesh

Challenges:

Network connectivity between clusters
Certificate and identity federation
Increased latency for cross-cluster calls
Complexity of debugging multi-cluster issues

Popular Service Mesh Implementations

Istio

Open-source service mesh originally developed by Google, IBM, and Lyft. Most feature-rich, most complex.

Architecture:

Data plane: Envoy proxy sidecars
Control plane: istiod (single binary consolidating Pilot, Citadel, Galley)

Strengths:

Comprehensive feature set
Strong security (mTLS, RBAC, authorization policies)
Advanced traffic management (weighted routing, mirroring)
Rich observability integration (Prometheus, Grafana, Jaeger, Kiali)
Large ecosystem and community

Challenges:

Steep learning curve
Resource intensive (control plane and sidecars)
Complex configuration (CRDs, YAML)
Performance overhead (powerful but heavier than alternatives)

Best for: Large organizations, complex multi-cluster deployments, comprehensive feature requirements.

Linkerd

Lightweight, Kubernetes-native service mesh focused on simplicity and performance.

Architecture:

Data plane: Linkerd2-proxy (custom Rust-based proxy, faster than Envoy)
Control plane: Minimal components (destination, identity, proxy-injector)

Strengths:

Simple to install and operate
Low resource overhead
Fast (optimized Rust proxy)
Security by default (automatic mTLS)
Excellent observability dashboard

Challenges:

Kubernetes-only (no support for VMs or other platforms)
Fewer features than Istio
Smaller ecosystem and community

Best for: Kubernetes-focused organizations, teams prioritizing simplicity, performance-sensitive workloads.

AWS App Mesh

AWS-managed service mesh for ECS, EKS, and EC2.

Architecture:

Data plane: Envoy proxy sidecars
Control plane: Managed by AWS

Strengths:

Integrated with AWS services (CloudWatch, X-Ray, CloudMap)
No control plane to manage
Works across ECS, EKS, and EC2 (VM-based services)
Pay-as-you-go pricing (no upfront cost)

Challenges:

AWS-specific (vendor lock-in)
Fewer features than Istio
Less mature than open-source alternatives

Best for: AWS-centric organizations, teams using ECS and EKS together, preference for managed services.

Official documentation: AWS App Mesh Documentation

Consul Connect

Service mesh from HashiCorp, part of Consul service discovery platform.

Architecture:

Data plane: Envoy or built-in proxy
Control plane: Consul servers

Strengths:

Multi-platform (Kubernetes, VMs, bare metal)
Multi-cloud and hybrid cloud support
Integrated with Consul service catalog
Strong multi-datacenter federation

Challenges:

Requires Consul knowledge
Less Kubernetes-native than Istio/Linkerd
Smaller community than Istio

Best for: Multi-cloud deployments, hybrid environments with VMs and containers, existing Consul users.

Service Mesh on AWS ECS (App Mesh)

AWS App Mesh is the recommended service mesh for Amazon ECS. It provides service mesh capabilities for containerized applications running on ECS (Fargate or EC2).

App Mesh Architecture on ECS

┌──────────────────────────────────────────────┐
│             AWS App Mesh                     │
│         (Control Plane - Managed)            │
└──────────────────────────────────────────────┘
                    │
        ┌───────────┴────────────┐
        ▼                        ▼
┌─────────────────┐      ┌─────────────────┐
│  ECS Service A  │      │  ECS Service B  │
│  ┌───────────┐  │      │  ┌───────────┐  │
│  │ App       │  │      │  │ App       │  │
│  │ Container │  │      │  │ Container │  │
│  └───────────┘  │      │  └───────────┘  │
│  ┌───────────┐  │      │  ┌───────────┐  │
│  │ Envoy     │  │      │  │ Envoy     │  │
│  │ Sidecar   │  │      │  │ Sidecar   │  │
│  └───────────┘  │      │  └───────────┘  │
└─────────────────┘      └─────────────────┘

App Mesh Core Concepts

Mesh: Logical boundary for service mesh (typically one per application or environment). Defines egress filtering policy.

Virtual Service: Abstract name for a service (e.g., order-service.mesh.local). Clients call virtual services, not actual task IPs. Decouples service consumers from providers.

Virtual Node: Represents a logical service (ECS service or task group). Configures health checks, backends it can call, and service discovery mechanism.

Virtual Router: Routes traffic to virtual nodes based on rules (weighted routing, header matching). Enables canary deployments and A/B testing.

Route: Defines routing logic within a virtual router (HTTP routes, gRPC routes, TCP routes). Specifies retry policies, timeouts, and traffic distribution.

Virtual Gateway: Entry point for traffic from outside the mesh to services inside the mesh.

App Mesh Implementation Considerations

ECS Task Definition Setup:

Configure proxy configuration to integrate Envoy sidecar
Define app container dependencies on Envoy health check
Include X-Ray daemon container for distributed tracing
Set proper IAM roles for App Mesh API access

Service Discovery Integration:

App Mesh integrates with AWS Cloud Map for service discovery
Virtual nodes reference Cloud Map services
Enables dynamic endpoint discovery as tasks scale

Traffic Management Capabilities:

Weighted routing: Distribute traffic across service versions (90/10, 50/50, etc.)
Header-based routing: Route to different versions based on request headers
Retry policies: Configure automatic retries on failures with exponential backoff
Timeout policies: Set per-request and idle timeouts to prevent hanging requests

Security Features:

mTLS encryption: Configure TLS certificates from AWS Certificate Manager (ACM)
Client policy: Enforce TLS validation for outbound connections
IAM integration: Use IAM task roles to control access to App Mesh APIs
Egress filtering: Block or allow traffic to external services

Observability Integration:

CloudWatch Metrics: Envoy exports metrics automatically
AWS X-Ray: Distributed tracing for request paths
CloudWatch Logs: Access logs for all service-to-service calls
CloudWatch Alarms: Set alerts on error rates and latency

Canary Deployment Strategy:

Deploy new version as separate virtual node
Create route with small weight to canary (10%)
Monitor error rates and latency metrics
Gradually increase canary weight (25%, 50%, 75%, 100%)
Remove old version after successful rollout

Service Mesh on AWS EKS

EKS supports multiple service mesh options. Istio and Linkerd are the most popular.

Istio on EKS

Istio provides comprehensive service mesh capabilities with extensive features. It uses Envoy as the data plane proxy and a centralized control plane (istiod).

Installation approach: Use the Istio CLI (istioctl) or Helm charts to install the control plane and configure automatic sidecar injection for namespaces.

Core configuration resources:

VirtualService: Define routing rules (traffic splitting, header-based routing, retries, timeouts)
DestinationRule: Configure load balancing, connection pools, circuit breaking, TLS settings
Gateway: Define ingress/egress points for the mesh
PeerAuthentication: Enforce mTLS requirements
AuthorizationPolicy: Define which services can communicate

Traffic Management Patterns:

Canary deployments: Route percentage of traffic to new version based on weight
Header-based routing: Route beta users to new version while others use stable
Traffic mirroring: Copy production traffic to test environment for validation
Fault injection: Intentionally introduce delays or errors for chaos testing

Security Capabilities:

Automatic mTLS: Encrypt all service-to-service communication automatically
Service-level authorization: Control which services can call which endpoints
External CA integration: Use external certificate authorities for identity management
Request authentication: Validate JWT tokens from external identity providers

Observability Tools:

Prometheus: Automatic metrics collection from Envoy proxies
Grafana: Dashboards for service metrics and mesh health
Jaeger: Distributed tracing to visualize request flows
Kiali: Service mesh topology visualization and configuration management

Performance Considerations:

Higher resource overhead compared to Linkerd (Envoy proxy is heavier)
More complex configuration (many CRDs to learn)
Extensive features justify overhead for large-scale deployments

Linkerd on EKS

Linkerd focuses on simplicity and performance with a lightweight Rust-based proxy.

Installation approach: Use Linkerd CLI to install control plane components and enable automatic sidecar injection.

Core configuration resources:

TrafficSplit: SMI (Service Mesh Interface) resource for weighted traffic distribution
ServiceProfile: Define routes with retries, timeouts, and response classifications
Server: Define what ports a service exposes and protocol details
ServerAuthorization: Control which services can access specific servers

Traffic Management Patterns:

Traffic splitting: Simple percentage-based routing between service versions
Retry budgets: Limit total retry percentage to prevent retry storms
Timeouts: Per-route timeout configuration
Load balancing: Exponentially weighted moving average (EWMA) by default

Security Capabilities:

Automatic mTLS by default: All meshed traffic encrypted without configuration
Policy-based authorization: Define which service accounts can access which services
Zero-config security: Simpler than Istio with fewer knobs to turn

Observability Tools:

Linkerd Viz: Built-in dashboard for service metrics and topology
CLI observability: Real-time traffic monitoring via CLI (tap, stat, top)
Prometheus integration: Export metrics for external monitoring
Grafana dashboards: Pre-built dashboards for Linkerd metrics

Performance Advantages:

Lower resource overhead (Rust proxy is lighter than Envoy)
Faster request processing
Simpler architecture reduces operational complexity

Istio vs Linkerd on EKS

Aspect	Istio	Linkerd
Complexity	Higher (many CRDs, complex config)	Lower (simpler, opinionated)
Performance	Heavier (Envoy proxy)	Lighter (Rust-based proxy)
Features	Comprehensive	Focused (core features only)
Traffic management	Advanced (mirroring, fault injection)	Basic (splits, retries, timeouts)
Security	mTLS, authorization policies, external CA	mTLS, authorization, policy
Observability	Prometheus, Grafana, Jaeger, Kiali	Linkerd Viz, Prometheus
Multi-cluster	Advanced federation	Basic multi-cluster support
Community	Large, enterprise adoption	Smaller, growing
Best for	Complex requirements, large scale	Simplicity, Kubernetes-native

Decision framework:

Choose Istio if you need advanced traffic management, multi-cluster support, fault injection, or comprehensive features
Choose Linkerd if you prioritize simplicity, performance, minimal resource overhead, and core service mesh capabilities

Official documentation:

Service Mesh Best Practices

Start Small

Don’t mesh everything at once. Start with a few services, validate the approach, then expand.

Pilot phase:

Choose 2-3 non-critical services
Enable mesh features incrementally (observability → mTLS → traffic management)
Monitor performance impact
Train team on mesh operations
Document runbooks for common tasks

Monitor Resource Overhead

Service mesh adds CPU and memory overhead.

Typical overhead:

Sidecar proxy: 50-100MB memory, 0.1-0.5 vCPU per pod
Control plane: 500MB-2GB memory, 1-2 vCPUs (varies by mesh)

Mitigation:

Set resource requests and limits for sidecars
Use smaller instance types or scale out
Monitor and right-size based on actual usage

Use Progressive Rollouts

Enable mesh features gradually.

Rollout sequence:

Observability: Metrics and tracing (low risk, high value)
mTLS: Encrypted communication (medium risk, high value)
Traffic management: Retries, timeouts (test thoroughly)
Advanced features: Circuit breaking, rate limiting (as needed)

Implement Health Checks

Configure health checks to remove unhealthy instances from load balancing.

Health check best practices:

Separate liveness and readiness probes
Test dependencies in readiness, not liveness
Set appropriate thresholds (don’t be too aggressive)
Monitor health check failures

Establish Rollback Procedures

Know how to quickly disable mesh features or remove mesh entirely.

Rollback strategies:

Feature flags to disable mesh routing
Remove sidecar injection annotation
Redeploy without mesh
Have documented runbook for emergency rollback

Key Takeaways

Service mesh solves cross-cutting concerns: Retries, timeouts, encryption, observability without changing application code.

Sidecar pattern is foundational: Proxy container alongside each application container handles networking.

Not always necessary: Small deployments with few services may not justify the complexity.

API gateway and service mesh are complementary: Use API gateway for north-south traffic, service mesh for east-west.

Multiple implementation options: Istio (feature-rich), Linkerd (simple), App Mesh (AWS-managed), Consul Connect (multi-platform).

AWS ECS uses App Mesh: Managed control plane, Envoy sidecars, integrated with CloudWatch and X-Ray.

AWS EKS supports Istio and Linkerd: Choose Istio for advanced features, Linkerd for simplicity and performance.

Start small and iterate: Pilot with a few services, enable features progressively, monitor resource overhead.

Security by default with mTLS: Automatic encryption and authentication between services without application changes.

Observability without instrumentation: Distributed tracing, metrics, and access logs generated automatically by the mesh.