Multi-Region Architecture on AWS for System Architects
Table of Contents
- What Problems Multi-Region Architecture Solves
- Multi-Region Deployment Patterns
- Data Replication Strategies
- Traffic Routing with Route 53
- Multi-Region Application Architectures
- Consistency Models & Trade-offs
- Cost Optimization Strategies
- Monitoring & Observability
- When to Use Multi-Region vs Single-Region
- Common Pitfalls
- Key Takeaways
What Problems Multi-Region Architecture Solves
Single-region deployments introduce risks:
- Regional outages: AWS regions have failed (US-EAST-1 outages in 2017, 2021, 2023)
- Example: December 2021 US-EAST-1 outage took down major services for 7+ hours
- Impact: 100% downtime if single-region deployed
- Latency for global users: User in Sydney accessing US-EAST-1 experiences 200-300ms baseline latency
- Example: Web app renders in 50ms locally but 300ms globally
- Impact: Poor user experience, reduced conversion rates
- Compliance & data residency: Regulations require data stored in specific regions
- Example: GDPR requires EU customer data stored in EU
- Impact: Cannot serve EU customers from US-only deployment
- Disaster recovery limitations: Single-region DR still vulnerable to regional failure
- Example: Multi-AZ RDS survives AZ failure, not region failure
- Impact: Cannot meet aggressive RPO/RTO targets
Multi-region architecture solves:
- Regional failure tolerance: Continue operating if entire region fails
- Global performance: Serve users from nearest region (50-300ms latency reduction)
- Compliance: Store and process data in required regions
- Aggressive DR targets: RTO/RPO in seconds/minutes vs hours/days
Multi-Region Deployment Patterns
Pattern 1: Active-Passive (Warm Standby)
Architecture: Primary region serves 100% traffic; secondary region idle (warm) or minimal traffic.
Components:
- Primary: Full production environment (compute, databases, cache)
- Secondary: Scaled-down environment (e.g., 10% capacity) or fully provisioned but idle
Failover mechanism:
- Health check detects primary region failure
- Route 53 failover routing switches traffic to secondary
- Secondary scales up to full capacity (if scaled down)
- Primary traffic drains to secondary
Cost profile:
- Primary: 100% of single-region cost
- Secondary (scaled down): ~20% of primary cost (minimal compute, full data replication)
- Total: ~120% of single-region cost
Example configuration:
{
"Primary (us-east-1)": {
"Compute": "100 EC2 instances (m5.large)",
"Database": "RDS Multi-AZ (db.r5.2xlarge)",
"Cache": "ElastiCache cluster (3 nodes)",
"Cost": "$15,000/month"
},
"Secondary (us-west-2)": {
"Compute": "10 EC2 instances (m5.large) + ASG",
"Database": "RDS Read Replica",
"Cache": "ElastiCache cluster (1 node)",
"Cost": "$3,000/month"
},
"Data Transfer": "$500/month (replication)",
"Total": "$18,500/month"
}
RTO/RPO:
- RPO: ~5-15 minutes (last replicated transaction)
- RTO: 5-30 minutes (DNS propagation + scale-up)
When to use:
- Cost-sensitive deployments
- Can tolerate 5-30 minute RTO
- Infrequent failover (testing quarterly)
Pattern 2: Active-Active (Multi-Region)
Architecture: Multiple regions serve production traffic simultaneously.
Components:
- All regions: Full production environment
- Traffic distribution: Route 53 geo-proximity or latency-based routing
- Data synchronization: Bi-directional replication (DynamoDB Global Tables, Aurora Global Database)
Advantages:
- Zero RTO: Failure in one region = traffic automatically shifts to healthy regions
- Performance: Users served from nearest region (latency optimized)
- Regional capacity: Each region handles partial load (easier to scale)
Challenges:
- Data consistency: Multi-region writes require conflict resolution
- Cost: 2× compute costs (full capacity in each region)
- Complexity: Orchestrate deployments across regions, manage split-brain scenarios
Cost profile:
- Region 1: $15,000/month
- Region 2: $15,000/month
- Data replication: $2,000/month (cross-region transfer)
- Total: $32,000/month (2.1× single-region)
Example configuration:
{
"us-east-1": {
"Compute": "100 EC2 instances",
"Database": "Aurora Global Database (writer)",
"Cache": "ElastiCache cluster",
"Traffic": "50% (geo-proximity routing)"
},
"eu-west-1": {
"Compute": "100 EC2 instances",
"Database": "Aurora Global Database (writer)",
"Cache": "ElastiCache cluster",
"Traffic": "50% (geo-proximity routing)"
}
}
RTO/RPO:
- RPO: Near-zero (continuous replication, <1s lag)
- RTO: Near-zero (automatic failover via Route 53 health checks)
When to use:
- Mission-critical applications (zero downtime requirement)
- Global user base (latency optimization)
- Can justify 2× cost for availability/performance
Pattern 3: Multi-Region Read Replicas
Architecture: Primary region handles writes; multiple regions handle reads.
Components:
- Primary (us-east-1): Write traffic, authoritative data source
- Secondaries (eu-west-1, ap-southeast-1): Read-only replicas
Use cases:
- Read-heavy workloads (90% reads, 10% writes)
- Geo-distributed read performance required
- Write latency acceptable for global users
Cost profile:
- Primary: $15,000/month
- Read replicas (2 regions): $10,000/month ($5K each)
- Data replication: $1,000/month
- Total: $26,000/month (1.7× single-region)
RTO/RPO:
- RPO: 5-60 seconds (replication lag)
- RTO: 5-15 minutes (promote read replica to writer)
When to use:
- Read-heavy applications (content delivery, e-commerce product catalog)
- Write centralization acceptable (lower complexity than active-active)
- Cost-conscious (cheaper than active-active)
Data Replication Strategies
Aurora Global Database
Architecture: Primary region writer, up to 5 secondary regions with read replicas.
Replication characteristics:
- Lag: Typically <1 second
- Method: Physical replication (storage layer)
- Consistency: Eventual (read replicas lag behind writer)
- Failover: Promotes secondary to primary in <1 minute
Configuration:
# Create global database
aws rds create-global-cluster \
--global-cluster-identifier my-global-db \
--engine aurora-mysql \
--engine-version 8.0.mysql_aurora.3.04.0
# Add primary cluster
aws rds create-db-cluster \
--db-cluster-identifier primary-cluster \
--engine aurora-mysql \
--global-cluster-identifier my-global-db \
--master-username admin \
--master-user-password <password> \
--region us-east-1
# Add secondary cluster
aws rds create-db-cluster \
--db-cluster-identifier secondary-cluster \
--engine aurora-mysql \
--global-cluster-identifier my-global-db \
--region eu-west-1
Failover process:
# Detach secondary from global database
aws rds remove-from-global-cluster \
--db-cluster-identifier secondary-cluster \
--global-cluster-identifier my-global-db \
--region eu-west-1
# Promote secondary to standalone (now writable)
# Secondary is now independent writer
# Application connection strings updated to point to new writer
Cost:
- Primary cluster: Standard Aurora pricing
- Secondary clusters: Standard Aurora pricing (billed per region)
- Data transfer: $0.02/GB cross-region replication
When to use:
- Relational data requiring low-latency reads globally
- Can tolerate <1s replication lag
- Need fast failover (<1 minute RTO)
DynamoDB Global Tables
Architecture: Multi-region, multi-master (all regions writable).
Replication characteristics:
- Lag: Typically <1 second
- Method: Asynchronous replication (stream-based)
- Consistency: Eventual (last-writer-wins conflict resolution)
- Failover: No failover needed (all regions always writable)
Configuration:
# Create table in primary region
aws dynamodb create-table \
--table-name users \
--attribute-definitions \
AttributeName=user_id,AttributeType=S \
--key-schema \
AttributeName=user_id,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES \
--region us-east-1
# Create global table
aws dynamodb update-table \
--table-name users \
--replica-updates '[
{"Create": {"RegionName": "eu-west-1"}},
{"Create": {"RegionName": "ap-southeast-1"}}
]' \
--region us-east-1
Conflict resolution:
DynamoDB uses last-writer-wins based on timestamp:
# Scenario: Concurrent writes in us-east-1 and eu-west-1
# us-east-1 (10:00:00.500)
dynamodb.put_item(
TableName='users',
Item={'user_id': 'user123', 'name': 'Alice', 'email': 'alice@example.com'}
)
# eu-west-1 (10:00:00.700) # 200ms later
dynamodb.put_item(
TableName='users',
Item={'user_id': 'user123', 'name': 'Alice', 'email': 'alice@newdomain.com'}
)
# Result: email = 'alice@newdomain.com' (latest write wins)
Cost:
- Replicated write units (rWCU): 2× write cost (data written to each region)
- Storage: Charged per region (duplicate data in each region)
- Data transfer: $0.02/GB cross-region replication
Example:
- 100 WCU/second, 3 regions
- Cost: 100 WCU × 3 regions × $0.00065/hour = $0.195/hour = $141/month
When to use:
- NoSQL data model fits requirements
- Need low-latency writes globally (not just reads)
- Can tolerate eventual consistency and last-writer-wins
S3 Cross-Region Replication (CRR)
Architecture: Replicate objects from source bucket to destination bucket(s) in different region(s).
Replication characteristics:
- Lag: Typically <15 minutes (99.99% within 15 min)
- Method: Asynchronous object-level replication
- Consistency: Eventual (objects replicate after write)
- Filtering: Replicate by prefix, tags, or all objects
Configuration:
{
"Role": "arn:aws:iam::123456789012:role/s3-replication-role",
"Rules": [
{
"Id": "ReplicateAll",
"Status": "Enabled",
"Priority": 1,
"Filter": {},
"Destination": {
"Bucket": "arn:aws:s3:::my-bucket-eu-west-1",
"ReplicationTime": {
"Status": "Enabled",
"Time": {
"Minutes": 15
}
},
"Metrics": {
"Status": "Enabled",
"EventThreshold": {
"Minutes": 15
}
}
},
"DeleteMarkerReplication": {
"Status": "Enabled"
}
}
]
}
aws s3api put-bucket-replication \
--bucket my-bucket-us-east-1 \
--replication-configuration file://replication-config.json
S3 Replication Time Control (RTC):
- SLA: 99.99% of objects replicated within 15 minutes
- Cost: Additional $0.015/GB (on top of standard replication cost)
Cost:
- Replication (standard): $0.02/GB cross-region transfer
- RTC: Additional $0.015/GB
- Storage: Duplicate storage costs in destination region
- Total: $0.02/GB (standard) or $0.035/GB (RTC) + storage
When to use:
- Static assets (images, videos, documents)
- Backup and compliance (data durability across regions)
- Content distribution (serve assets from nearest region)
Traffic Routing with Route 53
Geolocation Routing
Use case: Route traffic based on user’s geographic location (country/continent).
Example: EU users → eu-west-1, US users → us-east-1, Asia users → ap-southeast-1
{
"RecordSets": [
{
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "US",
"GeoLocation": {
"ContinentCode": "NA"
},
"AliasTarget": {
"HostedZoneId": "Z123456",
"DNSName": "us-alb.us-east-1.amazonaws.com"
}
},
{
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "EU",
"GeoLocation": {
"ContinentCode": "EU"
},
"AliasTarget": {
"HostedZoneId": "Z789012",
"DNSName": "eu-alb.eu-west-1.amazonaws.com"
}
},
{
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "Default",
"GeoLocation": {
"ContinentCode": "*"
},
"AliasTarget": {
"HostedZoneId": "Z123456",
"DNSName": "us-alb.us-east-1.amazonaws.com"
}
}
]
}
Considerations:
- Deterministic: User in Germany always routes to EU
- Health checks: If EU unhealthy, does NOT automatically failover to US (need failover routing)
- Use case: Compliance (EU data must stay in EU), content localization
Latency-Based Routing
Use case: Route traffic to region with lowest latency for user.
Example: User in London may route to us-east-1 if lower latency than eu-west-1 (rare, but possible during network issues).
{
"RecordSets": [
{
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "US-EAST-1",
"Region": "us-east-1",
"AliasTarget": {
"HostedZoneId": "Z123456",
"DNSName": "us-alb.us-east-1.amazonaws.com"
},
"HealthCheckId": "health-check-us"
},
{
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "EU-WEST-1",
"Region": "eu-west-1",
"AliasTarget": {
"HostedZoneId": "Z789012",
"DNSName": "eu-alb.eu-west-1.amazonaws.com"
},
"HealthCheckId": "health-check-eu"
}
]
}
Behavior:
- Route 53 measures latency from user to each region
- Returns lowest-latency endpoint
- If endpoint unhealthy (health check fail), returns next-lowest-latency healthy endpoint
When to use:
- Performance optimization (no compliance restrictions)
- Automatic failover to healthy regions
- User base distributed globally
Geoproximity Routing
Use case: Route traffic based on geographic location + bias (shift traffic between regions).
Example: Shift 30% of US East Coast traffic from us-east-1 to eu-west-1 (load balancing).
{
"RecordSets": [
{
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "US-EAST-1",
"GeoProximityLocation": {
"AWSRegion": "us-east-1",
"Bias": -20
},
"AliasTarget": {
"DNSName": "us-alb.us-east-1.amazonaws.com"
}
},
{
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "EU-WEST-1",
"GeoProximityLocation": {
"AWSRegion": "eu-west-1",
"Bias": 20
},
"AliasTarget": {
"DNSName": "eu-alb.eu-west-1.amazonaws.com"
}
}
]
}
Bias values:
- Negative bias (-99 to -1): Reduce traffic to this region
- Zero bias (0): No adjustment (geographic proximity only)
- Positive bias (1 to 99): Increase traffic to this region
When to use:
- Load balancing between regions (shift traffic gradually)
- Testing new region with small percentage of traffic
- Cost optimization (shift traffic to cheaper region)
Failover Routing
Use case: Primary-secondary failover (active-passive pattern).
{
"RecordSets": [
{
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "Primary",
"Failover": "PRIMARY",
"HealthCheckId": "health-check-primary",
"AliasTarget": {
"DNSName": "primary-alb.us-east-1.amazonaws.com"
}
},
{
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "Secondary",
"Failover": "SECONDARY",
"AliasTarget": {
"DNSName": "secondary-alb.us-west-2.amazonaws.com"
}
}
]
}
Behavior:
- Route 53 returns PRIMARY if health check passes
- Route 53 returns SECONDARY if PRIMARY health check fails
- When PRIMARY recovers, traffic shifts back
Health check configuration:
aws route53 create-health-check \
--caller-reference $(date +%s) \
--health-check-config \
Type=HTTPS,\
ResourcePath=/health,\
FullyQualifiedDomainName=primary-alb.us-east-1.amazonaws.com,\
Port=443,\
RequestInterval=30,\
FailureThreshold=3
Thresholds:
- Request interval: 10 or 30 seconds
- Failure threshold: Number of consecutive failures before marking unhealthy (1-10)
- Detection time: 30s interval × 3 failures = 90 seconds to detect failure
Multi-Region Application Architectures
Serverless Multi-Region (Lambda + API Gateway + DynamoDB)
Architecture:
Users (Global)
↓
Route 53 (Latency-based routing)
↓
┌─────────────────────────┬─────────────────────────┐
│ us-east-1 │ eu-west-1 │
│ API Gateway │ API Gateway │
│ ↓ │ ↓ │
│ Lambda │ Lambda │
│ ↓ │ ↓ │
│ DynamoDB ← Replication → DynamoDB │
│ Global Table │ Global Table │
└─────────────────────────┴─────────────────────────┘
Deployment with SAM:
# template.yaml
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Parameters:
Region:
Type: String
Default: us-east-1
Resources:
ApiGateway:
Type: AWS::Serverless::Api
Properties:
StageName: prod
EndpointConfiguration:
Type: REGIONAL
GetUserFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: functions/get-user/
Handler: app.lambda_handler
Runtime: python3.11
Environment:
Variables:
TABLE_NAME: !Ref UsersTable
REGION: !Ref Region
Events:
GetUser:
Type: Api
Properties:
RestApiId: !Ref ApiGateway
Path: /users/{id}
Method: GET
UsersTable:
Type: AWS::DynamoDB::GlobalTable
Properties:
TableName: users-global
BillingMode: PAY_PER_REQUEST
StreamSpecification:
StreamViewType: NEW_AND_OLD_IMAGES
AttributeDefinitions:
- AttributeName: user_id
AttributeType: S
KeySchema:
- AttributeName: user_id
KeyType: HASH
Replicas:
- Region: us-east-1
- Region: eu-west-1
Deployment script:
#!/bin/bash
# deploy-multi-region.sh
REGIONS=("us-east-1" "eu-west-1")
for REGION in "${REGIONS[@]}"; do
echo "Deploying to $REGION..."
sam build
sam deploy \
--stack-name my-app-$REGION \
--region $REGION \
--parameter-overrides Region=$REGION \
--capabilities CAPABILITY_IAM \
--no-confirm-changeset
done
Benefits:
- Zero infrastructure management: Serverless scales automatically
- Cost-efficient: Pay per request (no idle capacity)
- Fast failover: Route 53 health checks detect API Gateway failure in <60s
Limitations:
- Cold starts: Lambda cold start adds 100-500ms latency
- DynamoDB eventual consistency: Replication lag <1s
- State management: Lambda stateless (use DynamoDB or ElastiCache for state)
Container Multi-Region (ECS + Aurora Global Database)
Architecture:
CloudFront (Global CDN)
↓
Route 53 (Geolocation routing)
↓
┌──────────────────────────────┬──────────────────────────────┐
│ us-east-1 │ eu-west-1 │
│ ALB │ ALB │
│ ↓ │ ↓ │
│ ECS Fargate (10 tasks) │ ECS Fargate (10 tasks) │
│ ↓ │ ↓ │
│ Aurora Global DB (Writer) ← Replication → Aurora (Reader) │
│ ElastiCache (Redis) │ ElastiCache (Redis) │
└──────────────────────────────┴──────────────────────────────┘
ECS service configuration:
{
"serviceName": "web-app",
"taskDefinition": "web-app:1",
"desiredCount": 10,
"launchType": "FARGATE",
"loadBalancers": [
{
"targetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/web-app/abc123",
"containerName": "app",
"containerPort": 8080
}
],
"networkConfiguration": {
"awsvpcConfiguration": {
"subnets": ["subnet-12345", "subnet-67890"],
"securityGroups": ["sg-app"],
"assignPublicIp": "DISABLED"
}
},
"healthCheckGracePeriodSeconds": 60,
"deploymentConfiguration": {
"maximumPercent": 200,
"minimumHealthyPercent": 100,
"deploymentCircuitBreaker": {
"enable": true,
"rollback": true
}
}
}
Application connection logic (detect region, connect to local Aurora reader):
import boto3
import os
REGION = os.environ['AWS_REGION']
# Primary region uses writer, secondary regions use local reader
if REGION == 'us-east-1':
DB_ENDPOINT = 'my-global-db-cluster.cluster-xyz.us-east-1.rds.amazonaws.com' # Writer
else:
DB_ENDPOINT = f'my-global-db-cluster.cluster-xyz.{REGION}.rds.amazonaws.com' # Local reader
# For writes in secondary regions, route to primary writer
WRITER_ENDPOINT = 'my-global-db-cluster.cluster-xyz.us-east-1.rds.amazonaws.com'
# Read from local reader (low latency)
read_conn = psycopg2.connect(
host=DB_ENDPOINT,
database='mydb',
user='app_user',
password=os.environ['DB_PASSWORD']
)
# Write to primary writer (may have higher latency from secondary regions)
write_conn = psycopg2.connect(
host=WRITER_ENDPOINT,
database='mydb',
user='app_user',
password=os.environ['DB_PASSWORD']
)
Benefits:
- Consistent performance: ECS tasks always warm (no cold starts)
- Strong consistency: Aurora supports synchronous replication within region
- Session affinity: ALB sticky sessions support stateful apps
Limitations:
- Higher cost: ECS Fargate + Aurora more expensive than serverless
- Write latency: Secondary regions write to primary (added latency)
Consistency Models & Trade-offs
Strong Consistency (Aurora Global Database)
Guarantee: Read in secondary region may lag behind primary (eventual consistency), but primary always strongly consistent.
Trade-offs:
- ✅ Primary writes: Strongly consistent (synchronous multi-AZ replication)
- ✅ Failover: Promotes secondary to primary with <1min RTO, <1s RPO
- ❌ Secondary writes: Not supported (read-only)
- ❌ Replication lag: Secondary reads may lag <1s behind primary
Use case: Financial transactions, inventory management (write centralization acceptable)
Eventual Consistency (DynamoDB Global Tables)
Guarantee: All regions eventually consistent (typically <1s), last-writer-wins conflict resolution.
Trade-offs:
- ✅ Multi-region writes: All regions writable (low-latency writes globally)
- ✅ Zero RTO: No failover needed (all regions active)
- ❌ Conflicts: Concurrent writes to same item = last-writer-wins (data loss possible)
- ❌ Consistency: Reads may return stale data (<1s lag)
Use case: User profiles, session state, content metadata (conflicts rare/acceptable)
Conflict example:
Scenario: User updates profile from phone (us-east-1) and laptop (eu-west-1) simultaneously
Phone (us-east-1, 10:00:00.100):
PUT /users/user123 { "email": "newemail@example.com" }
Laptop (eu-west-1, 10:00:00.150):
PUT /users/user123 { "phone": "+1-555-1234" }
# After replication:
# Result in both regions: { "email": "newemail@example.com", "phone": "+1-555-1234" } (merged)
# OR (if same attribute updated): Last write wins (phone update at 10:00:00.150)
Hybrid Model (Aurora Global + DynamoDB Global)
Strategy: Use Aurora for strongly consistent data (orders, payments), DynamoDB for eventually consistent data (user sessions, preferences).
Example architecture:
Orders Service (Aurora Global)
- Primary (us-east-1): Write orders
- Secondary (eu-west-1): Read orders (local performance)
Session Service (DynamoDB Global)
- All regions writable (low-latency session updates)
Cost Optimization Strategies
Multi-Region Cost Breakdown
Single-region baseline: $15,000/month
Active-passive (warm standby):
- Primary region: $15,000/month
- Secondary region: $3,000/month (10% capacity)
- Data replication: $500/month
- Total: $18,500/month (23% increase)
Active-active (multi-region):
- Region 1: $15,000/month
- Region 2: $15,000/month
- Data replication: $2,000/month
- Total: $32,000/month (113% increase)
Optimization Techniques
1. CloudFront for static assets:
Instead of replicating static assets (images, CSS, JS) to S3 in each region, use CloudFront with single-region origin.
Before:
- S3 in us-east-1: 1TB storage ($23/month)
- S3 in eu-west-1: 1TB storage ($23/month)
- CRR: 1TB replication ($20/month)
- Total: $66/month
After:
- S3 in us-east-1: 1TB storage ($23/month)
- CloudFront: 1TB data transfer ($85/month)
- Total: $108/month (but better performance + no replication complexity)
Alternative: Use CloudFront with regional caching (no S3 replication).
2. Read replicas vs full multi-region:
For read-heavy workloads, use read replicas instead of active-active.
Full active-active:
- 2 regions × 100 EC2 instances = $30,000/month
Read replicas:
- Primary (us-east-1): 100 EC2 instances ($15,000/month)
- Read replica regions: 50 EC2 instances each ($7,500/month × 2 = $15,000/month)
- Total: $30,000/month
Savings: No compute savings, but simpler architecture (one writable region).
3. Scheduled scaling for secondary region:
If active-passive with predictable failover testing schedule, scale down secondary except during tests.
Normal operation:
- Secondary: 10 instances ($1,500/month)
Failover test (1 day/month):
- Secondary: 100 instances ($150/day = $5/month)
Total secondary cost: $1,505/month vs $15,000/month (90% savings)
4. Fargate Spot for secondary regions:
Use Fargate Spot (70% discount) in secondary regions.
Primary (us-east-1):
- Fargate: 100 vCPUs × $0.04/hour = $2,880/month
Secondary (eu-west-1):
- Fargate Spot: 100 vCPUs × $0.012/hour = $864/month
Savings: $2,016/month (70%) on secondary compute
Risk: Fargate Spot interruptions (2-minute notice) - acceptable for secondary region.
Data Transfer Cost Optimization
Cross-region data transfer: $0.02/GB
Example application:
- Database replication: 100GB/day × 30 days = 3TB/month × $0.02 = $60/month
- Application inter-region calls: 50GB/day × 30 days = 1.5TB/month × $0.02 = $30/month
- S3 CRR: 200GB/day × 30 days = 6TB/month × $0.02 = $120/month
- Total: $210/month
Optimization:
- Compress data: Gzip/Brotli reduce transfer by 60-80% ($210 → $42-84/month)
- Batch replication: Replicate every 5 minutes vs real-time (reduce overhead)
- Filter replication: Only replicate critical data (not logs, temp files)
Monitoring & Observability
Multi-Region CloudWatch Dashboard
Metrics to track:
Metrics:
- Region: us-east-1
- EC2: CPUUtilization, NetworkIn, NetworkOut
- ALB: TargetResponseTime, HealthyHostCount, UnHealthyHostCount
- RDS: CPUUtilization, DatabaseConnections, ReplicaLag
- Route53: HealthCheckStatus
- Region: eu-west-1
- (Same metrics as us-east-1)
- Cross-Region:
- Route53: DNSQueries (by region)
- Aurora: AuroraGlobalDBReplicationLag
- DynamoDB: ReplicationLatency
Cross-region dashboard:
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["AWS/RDS", "AuroraGlobalDBReplicationLag", {"region": "us-east-1", "stat": "Maximum"}],
["...", {"region": "eu-west-1"}]
],
"period": 60,
"stat": "Maximum",
"title": "Aurora Global DB Replication Lag"
}
},
{
"type": "metric",
"properties": {
"metrics": [
["AWS/Route53", "HealthCheckStatus", {"region": "us-east-1"}],
["...", {"region": "eu-west-1"}]
],
"period": 60,
"stat": "Average",
"title": "Route 53 Health Checks"
}
}
]
}
Alarms for Multi-Region Health
Primary region unhealthy:
aws cloudwatch put-metric-alarm \
--alarm-name primary-region-unhealthy \
--alarm-description "Primary region health check failed" \
--metric-name HealthCheckStatus \
--namespace AWS/Route53 \
--statistic Minimum \
--period 60 \
--threshold 1 \
--comparison-operator LessThanThreshold \
--datapoints-to-alarm 2 \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts
Replication lag high:
aws cloudwatch put-metric-alarm \
--alarm-name aurora-replication-lag-high \
--alarm-description "Aurora Global DB replication lag > 5 seconds" \
--metric-name AuroraGlobalDBReplicationLag \
--namespace AWS/RDS \
--statistic Maximum \
--period 60 \
--threshold 5000 \
--comparison-operator GreaterThanThreshold \
--datapoints-to-alarm 3 \
--evaluation-periods 3 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts
Distributed Tracing with X-Ray
Multi-region tracing:
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all
patch_all()
@xray_recorder.capture('get_user')
def get_user(user_id):
# Read from local region (Aurora read replica)
local_region = os.environ['AWS_REGION']
xray_recorder.put_annotation('region', local_region)
xray_recorder.put_metadata('db_endpoint', DB_ENDPOINT)
user = db.query('SELECT * FROM users WHERE id = %s', [user_id])
return user
X-Ray service map shows:
- Request originated in us-east-1
- Read from us-east-1 Aurora reader (5ms)
- vs Request originated in eu-west-1
- Read from eu-west-1 Aurora reader (5ms) vs cross-region to us-east-1 writer (50ms)
When to Use Multi-Region vs Single-Region
Use Multi-Region When:
1. High availability requirements (99.99%+):
- Single-region multi-AZ: 99.95% SLA
- Multi-region: 99.99%+ achievable (regional failures don’t cause outage)
2. Global user base with latency requirements:
- User in Sydney: 250ms to us-east-1, 50ms to ap-southeast-2
- Multi-region reduces latency 80% (250ms → 50ms)
3. Compliance and data residency:
- GDPR: EU data must stay in EU
- Data sovereignty laws: China, Russia require in-country data storage
4. Disaster recovery with aggressive RTO/RPO:
- Multi-region active-active: RTO/RPO near-zero
- Single-region backup/restore: RTO 4-8 hours, RPO 1-24 hours
Use Single-Region When:
1. Cost-sensitive with acceptable availability:
- Multi-AZ: 99.95% SLA sufficient for most workloads
- Multi-region: 2-3× cost for 0.04% availability improvement
2. Regulatory restrictions:
- Data MUST NOT leave specific region (compliance)
- Multi-region replication violates policy
3. Simple architecture preferred:
- Multi-region adds complexity: data consistency, deployment orchestration, monitoring
- Small team may not have expertise/capacity
4. Low user concurrency:
- <10K users, <1000 RPS
- Single-region handles easily, multi-region overkill
Decision matrix:
| Requirement | Single-Region | Multi-Region |
|---|---|---|
| Availability SLA | 99.9-99.95% | 99.99%+ |
| Latency (global users) | 100-300ms | 10-50ms |
| RTO/RPO | 1-4 hours / 5-15 min | <1 min / <1s |
| Cost | Baseline | 2-3× |
| Complexity | Low | High |
| Compliance (data residency) | Limited | Flexible |
Common Pitfalls
1. Not Testing Failover Regularly
Problem: Multi-region deployed but failover never tested. During real outage, discover Route 53 health check misconfigured, secondary region can’t handle load.
Solution: Quarterly failover drills:
- Simulate primary region failure (block health check endpoint)
- Verify Route 53 fails over to secondary
- Verify secondary scales to handle load
- Measure actual RTO/RPO vs targets
- Restore primary, fail back
2. Ignoring Data Replication Lag
Problem: Aurora Global DB typical lag <1s, but during peak load increases to 10-30s. Secondary region serves stale data.
Solution: Monitor replication lag, alert if exceeds threshold:
aws cloudwatch put-metric-alarm \
--alarm-name replication-lag-high \
--metric-name AuroraGlobalDBReplicationLag \
--threshold 5000 \
--comparison-operator GreaterThanThreshold
3. Forgetting DNS TTL
Problem: Route 53 health check detects failure, switches to secondary, but clients cached DNS response (60s TTL). Failover takes 60+ seconds instead of 10s.
Solution: Lower TTL for critical records:
{
"TTL": 10,
"ResourceRecords": [{"Value": "primary-alb.us-east-1.amazonaws.com"}]
}
Trade-off: Lower TTL = more Route 53 queries = higher cost ($0.40 per million queries)
4. Not Accounting for Cross-Region Latency
Problem: Secondary region writes to primary Aurora writer (us-east-1 → eu-west-1 → back to us-east-1 for write). Write latency 100ms+ vs 5ms locally.
Solution: Accept write latency in secondary regions, or use DynamoDB Global Tables (multi-master).
5. Underprovisioning Secondary Region
Problem: Primary region has 100 EC2 instances, secondary has 10 (planning to scale up during failover). Failover happens, ASG takes 10 minutes to scale up, users experience errors.
Solution: Keep secondary at 50-100% capacity (accept higher cost), or pre-warm ASG with scheduled scaling before planned failover tests.
6. Not Isolating Failure Domains
Problem: Both regions share single AWS account. Account-level issue (IAM, billing problem) affects both regions.
Solution: Deploy regions in separate AWS accounts (multi-account strategy):
- us-east-1 in Account A
- eu-west-1 in Account B
- Isolates account-level failures
7. Hardcoding Region in Application
Problem:
DB_ENDPOINT = 'my-db.us-east-1.rds.amazonaws.com' # ❌ Hardcoded
Application deployed to eu-west-1 still connects to us-east-1 (cross-region latency).
Solution: Use environment variables, detect region dynamically:
REGION = os.environ['AWS_REGION']
DB_ENDPOINT = f'my-db.{REGION}.rds.amazonaws.com'
Key Takeaways
Multi-Region Patterns:
- Active-Passive (Warm Standby): Primary 100% traffic, secondary scaled down (10-50% capacity), failover via Route 53 - RTO 5-30min, RPO 5-15min, Cost 1.2-1.5×
- Active-Active: Both regions serve traffic, full capacity in each - RTO/RPO near-zero, Cost 2-2.5×
- Read Replicas: Primary handles writes, secondaries handle reads - RTO 5-15min, RPO <1min, Cost 1.5-2×
Data Replication:
- Aurora Global Database: Physical replication, <1s lag, fast failover (<1min), read-only secondaries, $0.02/GB cross-region transfer
- DynamoDB Global Tables: Multi-master, <1s lag, last-writer-wins conflict resolution, all regions writable, 2× write cost
- S3 CRR: Object replication, <15min lag (99.99% SLA with RTC), $0.02-0.035/GB
Traffic Routing:
- Geolocation: Route by user location (compliance, content localization), deterministic
- Latency-based: Route to lowest-latency region, performance optimization, automatic failover
- Geoproximity: Geographic + bias (shift traffic gradually for load balancing/testing)
- Failover: Primary-secondary with health checks, 60-90s detection time
Consistency Models:
- Strong consistency: Aurora Global DB (primary), single writable region, read replicas eventually consistent
- Eventual consistency: DynamoDB Global Tables, multi-master, <1s lag, last-writer-wins
- Hybrid: Aurora for transactions (strong), DynamoDB for sessions (eventual)
Cost Optimization:
- CloudFront for static assets: Eliminate S3 replication costs, global edge caching
- Read replicas vs active-active: Lower complexity, same compute cost, fewer writable regions
- Fargate Spot for secondary: 70% discount on secondary region compute (accept interruption risk)
- Compress data: 60-80% reduction in cross-region transfer costs ($0.02/GB)
Monitoring:
- Replication lag: Alert if Aurora replication >5s, DynamoDB >2s
- Health checks: Route 53 health checks every 10-30s, 1-10 failures before marking unhealthy
- Distributed tracing: X-Ray shows cross-region latency, identify bottlenecks
When to Use:
- Multi-region: 99.99%+ SLA, global latency optimization, data residency compliance, aggressive DR (RTO/RPO <1min)
- Single-region: 99.9-99.95% SLA acceptable, cost-sensitive, low user concurrency, simple architecture
Common Pitfalls:
- Not testing failover quarterly (discover issues during real outage)
- Ignoring replication lag during peak load (stale data served)
- High DNS TTL prevents fast failover (60s cache vs 10s optimal)
- Underprovisioning secondary region (ASG scale-up takes 10+ minutes during failover)
- Hardcoding region in application code (cross-region latency for all calls)
Found this guide helpful? Share it with your team:
Share on LinkedIn