Reliability Patterns

Architecture

Reliability Patterns

Reliability patterns help systems handle failures gracefully, maintain availability, and provide consistent service even when individual components fail.

Circuit Breaker

Pattern described by Michael Nygard in “Release It!” (2007)

Prevents cascading failures by monitoring failures and “opening the circuit” to stop requests to a failing service, similar to an electrical circuit breaker. Provides fast failure and fallback responses instead of waiting for timeouts.

States:

  • Closed: Normal operation, requests pass through (circuit is complete)
  • Open: Service is failing, requests immediately fail without calling service
  • Half-Open: After timeout, allows test requests to check if service recovered

Use When:

  • Calling external services that may be unreliable
  • Want to prevent cascading failures
  • Need to provide fallback behavior during outages
  • Timeouts alone aren’t sufficient (service takes long to respond even when healthy)

Example: Payment service that stops calling external payment gateway when it’s down and returns cached “payment pending” responses instead of timing out.

Closed: Request → Payment Gateway (success rate > 90%)
↓ (failures exceed threshold, e.g., 50% failures in 10 requests)
Open: Request → Immediate failure with fallback (60 seconds)
↓ (timeout expires)
Half-Open: Test request → Payment Gateway
  If success → Closed
  If failure → Open (reset timer)

Common implementations: Resilience4j, Hystrix (deprecated), Polly (.NET)


Retry Pattern

Automatically retries failed operations with appropriate delays and limits. Essential for handling transient failures in distributed systems.

Use When:

  • Operations may fail due to temporary issues (network blips, momentary overload)
  • Network calls that might timeout
  • Operations are idempotent (safe to retry without side effects)

Retry Strategies:

  • Fixed Delay: Wait same amount between retries (simple, but can cause thundering herd)
  • Exponential Backoff: Increase delay exponentially (1s, 2s, 4s, 8s, …)
    • Recommended for most scenarios
  • Exponential Backoff with Jitter: Add randomness to prevent synchronized retries
    • Best practice - prevents thundering herd problem
    • Formula: delay = base_delay * (2^attempt) + random(0, jitter)

Considerations:

  • Only retry transient failures (don’t retry 400/401/403 HTTP errors)
  • Implement maximum retry limits (typically 3-5 attempts)
  • Ensure operations are idempotent
  • Consider using circuit breaker alongside retries
  • Thundering herd problem: Many clients retrying simultaneously can overwhelm recovering service

Example: API client retrying failed HTTP requests with exponential backoff.

Attempt 1: Immediate → Fail (503 Service Unavailable)
Attempt 2: Wait 1s → Fail
Attempt 3: Wait 2s → Fail
Attempt 4: Wait 4s → Success (200 OK)

Common implementations: Resilience4j, Polly, AWS SDK built-in retries


Bulkhead Pattern

Isolates system resources to prevent failures in one area from affecting others, similar to ship compartments.

Use When:

  • Different operations have varying criticality
  • Want to isolate resource usage
  • Need to maintain core functionality during partial failures

Implementation Approaches:

  • Separate connection pools
  • Different thread pools
  • Resource quotas and limits

Example: Web application with separate thread pools for user requests, admin operations, and background tasks to prevent admin operations from blocking user traffic.

User Thread Pool (50 threads) → User Requests
Admin Thread Pool (10 threads) → Admin Operations
Background Thread Pool (20 threads) → Background Tasks

If admin operations saturate their pool, user requests unaffected

Timeout Pattern

Sets maximum wait time for operations to complete, preventing indefinite blocking.

Use When:

  • Making network calls
  • Operations that might hang indefinitely
  • Need predictable response times

Example: API gateway with 30-second timeout for backend services, returning error response if backend doesn’t respond in time.

Request → Backend Service
  If response within 30s → Return response
  If no response after 30s → Return 504 Gateway Timeout

Quick Reference

Pattern Comparison

Pattern Purpose Complexity When to Use
Circuit Breaker Prevent cascading failures Medium External service calls
Retry Handle transient failures Low Network operations
Bulkhead Isolate failures Medium Different criticality levels
Timeout Prevent indefinite waits Low Any network operation

Combining Patterns

Circuit Breaker + Retry:

Retry (3 attempts with backoff)
  ↓
Circuit Breaker (stops retries if service consistently fails)
  ↓
Timeout (ensures each attempt has max wait time)

Decision Tree

Preventing cascading failures? → Circuit Breaker Temporary network glitches? → Retry with backoff Isolating critical vs non-critical? → Bulkhead Operations might hang? → Timeout

Best Practices

Circuit Breaker: Monitor failure rate, adjust thresholds based on SLA Retry: Use exponential backoff with jitter, limit max retries Bulkhead: Size pools based on resource capacity and criticality Timeout: Set based on P99 latency, not average


Found this guide helpful? Share it with your team:

Share on LinkedIn