Disaster Recovery on AWS for System Architects
Table of Contents
- What Problems Disaster Recovery Solves
- RTO and RPO Fundamentals
- DR Strategy Selection Framework
- Backup and Restore
- Pilot Light
- Warm Standby
- Multi-Site Active-Active
- Data Backup Strategies
- Testing and Validation
- Cost Optimization
- When to Use Which DR Strategy
- Common Pitfalls
- Key Takeaways
What Problems Disaster Recovery Solves
Disasters that cause data loss or downtime:
- Regional failures: AWS region outages (rare but happen)
- Example: US-EAST-1 outage December 2021 (7+ hours)
- Impact: 100% downtime if single-region, no DR
- Data corruption: Application bugs, ransomware, accidental deletions
- Example: DELETE FROM users WHERE 1=1 (oops, forgot WHERE clause)
- Impact: Data loss, cannot restore without backups
- Natural disasters: Hurricanes, earthquakes, floods affecting data centers
- Example: Hurricane affecting entire AWS region
- Impact: Extended outage if data center physically damaged
- Human error: Misconfigurations, accidental resource deletions
- Example: Terraform destroy in production instead of staging
- Impact: Infrastructure destroyed, requires rebuild from backups
- Security incidents: Ransomware, data breach, malicious deletion
- Example: Attacker gains admin access, deletes all RDS databases
- Impact: Data loss, compliance violations, business continuity risk
DR solves:
- Business continuity: Continue operating during/after disaster
- Data protection: Restore data to point before corruption/loss
- Compliance: Meet regulatory requirements for data retention and recovery
- Risk mitigation: Reduce financial impact of downtime
RTO and RPO Fundamentals
Definitions
Recovery Time Objective (RTO): Maximum acceptable downtime
Disaster occurs → Recovery begins → Systems operational
|<-------------- RTO ------------->|
Example: “Our RTO is 4 hours” means systems must be operational within 4 hours of disaster.
Recovery Point Objective (RPO): Maximum acceptable data loss
Last successful backup → Disaster occurs
|<------- RPO ------->|
Data created in this window is lost
Example: “Our RPO is 1 hour” means we can tolerate losing up to 1 hour of data.
Cost-Complexity Spectrum
| RTO | RPO | DR Strategy | Complexity | Cost (relative to backup/restore baseline) |
|---|---|---|---|---|
| 24h | 24h | Backup & Restore | Low | 1× (baseline) |
| 4-12h | 1-4h | Pilot Light | Medium | 2-3× |
| 1-4h | 5-15min | Warm Standby | Medium-High | 3-5× |
| <1h | <1min | Multi-Site Active-Active | High | 6-10× |
Key insight: Aggressive RTO/RPO targets exponentially increase cost and complexity.
Business Impact Analysis
Determine appropriate RTO/RPO:
- Calculate downtime cost:
- Revenue per hour: $50,000/hour
- Lost customers per hour of downtime: 100
- Customer lifetime value: $1,000
- Reputation damage: Difficult to quantify but significant
Cost of 1-hour outage: $50,000 (revenue) + $100,000 (lost customers) = $150,000
- Evaluate data loss impact:
- Financial transactions: Cannot lose any (RPO near-zero)
- User-generated content: Can tolerate 15 minutes (RPO 15min)
- Logs and analytics: Can tolerate 24 hours (RPO 24h)
- Balance cost vs risk:
- Multi-site active-active: $30,000/month to prevent $150,000 loss
- ROI: $30K/month = $360K/year to protect against $150K per outage
- If outages happen 1/year: Not worth it ($360K > $150K)
- If outages happen 3+/year: Worth it ($360K < $450K)
DR Strategy Selection Framework
Decision Matrix
Step 1: Determine RTO/RPO requirements (from business impact analysis)
Step 2: Select DR strategy:
┌─────────────────────────────────────────────────────────────┐
│ DR Strategy Selection │
├─────────────────┬───────────┬──────────┬─────────────────────┤
│ Business Need │ RTO │ RPO │ Recommended Strategy│
├─────────────────┼───────────┼──────────┼─────────────────────┤
│ Dev/Test │ Days │ 24h │ Backup & Restore │
│ Internal Tools │ 12-24h │ 4-24h │ Backup & Restore │
│ Business Apps │ 4-12h │ 1-4h │ Pilot Light │
│ Customer-Facing │ 1-4h │ 5-15min │ Warm Standby │
│ Mission-Critical│ <1h │ <1min │ Multi-Site Active │
└─────────────────┴───────────┴──────────┴─────────────────────┘
Step 3: Validate cost-benefit:
- Estimated DR cost per month: $X
- Estimated outage cost: $Y per incident
- Expected outage frequency: Z per year
- Breakeven: $X × 12 months = $Y × Z incidents
If DR cost < outage cost × frequency: Invest in DR If DR cost > outage cost × frequency: Accept risk or choose cheaper DR strategy
Backup and Restore
Pattern: Regular backups stored offsite, restore when needed.
RTO: 12-24 hours (time to provision infrastructure + restore data) RPO: 1-24 hours (time since last backup) Cost: Baseline (1×) - only pay for storage
Architecture
Production (us-east-1)
├── EC2 instances → AMIs (daily snapshots)
├── RDS database → Automated backups (daily) + Manual snapshots (weekly)
├── S3 buckets → Versioning enabled, lifecycle policies
└── EBS volumes → Daily snapshots
Backups stored in:
├── Same region (different AZ) - Fast restore, same-region risk
└── Cross-region (us-west-2) - Slow restore, regional failure protection
Implementation
RDS automated backups:
# Enable automated backups (retention 7-35 days)
aws rds modify-db-instance \
--db-instance-identifier mydb \
--backup-retention-period 7 \
--preferred-backup-window "03:00-04:00" \
--region us-east-1
# Copy automated backup to another region
aws rds copy-db-snapshot \
--source-db-snapshot-identifier arn:aws:rds:us-east-1:123456789012:snapshot:mydb-snapshot \
--target-db-snapshot-identifier mydb-snapshot-dr \
--source-region us-east-1 \
--region us-west-2
EBS snapshots (automated with Data Lifecycle Manager):
{
"PolicyDetails": {
"ResourceTypes": ["VOLUME"],
"TargetTags": [
{"Key": "Backup", "Value": "true"}
],
"Schedules": [
{
"Name": "Daily Snapshots",
"CreateRule": {
"Interval": 24,
"IntervalUnit": "HOURS",
"Times": ["03:00"]
},
"RetainRule": {
"Count": 7
},
"CopyTags": true,
"CrossRegionCopyRules": [
{
"TargetRegion": "us-west-2",
"Encrypted": true,
"RetainRule": {
"Interval": 7,
"IntervalUnit": "DAYS"
}
}
]
}
]
}
}
S3 cross-region replication (for backup):
{
"Role": "arn:aws:iam::123456789012:role/s3-replication-role",
"Rules": [
{
"Status": "Enabled",
"Priority": 1,
"Filter": {},
"Destination": {
"Bucket": "arn:aws:s3:::myapp-backup-us-west-2",
"ReplicationTime": {
"Status": "Enabled",
"Time": {"Minutes": 15}
}
},
"DeleteMarkerReplication": {"Status": "Enabled"}
}
]
}
Infrastructure as Code (backup via Git):
# Store Terraform/CloudFormation in Git
git commit -am "Production infrastructure state"
git push origin main
# In DR region, recreate from IaC
git clone https://github.com/myorg/infrastructure
cd infrastructure
terraform init
terraform plan -var-file=dr-region.tfvars
terraform apply
Recovery Process
Scenario: US-EAST-1 region failure, recover to US-WEST-2
- Provision infrastructure (4-8 hours):
# Launch EC2 instances from AMI snapshots copied to us-west-2 aws ec2 run-instances \ --image-id ami-backup-12345 \ --instance-type m5.large \ --count 10 \ --region us-west-2 - Restore database (2-6 hours):
# Restore RDS from snapshot in us-west-2 aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier mydb-restored \ --db-snapshot-identifier mydb-snapshot-dr \ --region us-west-2 - Update DNS (5-60 minutes):
# Update Route 53 to point to us-west-2 resources aws route53 change-resource-record-sets \ --hosted-zone-id Z123456 \ --change-batch '{ "Changes": [{ "Action": "UPSERT", "ResourceRecordSet": { "Name": "app.example.com", "Type": "A", "AliasTarget": { "DNSName": "us-west-2-alb.amazonaws.com", "HostedZoneId": "Z789012" } } }] }' - Verify and test (1-2 hours):
- Smoke tests: Critical user journeys functional
- Data integrity: Recent backup data present
- Performance: Systems handling expected load
Total RTO: 8-17 hours (infrastructure + database + DNS + testing)
Cost
Storage costs:
- RDS backups: Free (within retention period)
- EBS snapshots: $0.05/GB-month (incremental)
- S3 CRR: $0.023/GB-month (Standard) + $0.02/GB replication
Example:
- RDS: 500GB database, 7-day retention = $0 (included)
- EBS: 1TB snapshots (incremental 20% change/day) = 200GB × $0.05 = $10/month
- S3: 10TB data replicated to DR region = 10TB × $0.023 = $230/month + 10TB × $0.02 = $200 replication
- Total: $440/month
When to use:
- Non-critical applications (tolerate 12-24h RTO)
- Infrequent disasters (cost of outage < cost of more expensive DR)
- Development/test environments
Pilot Light
Pattern: Minimal infrastructure running in DR region, scaled up during disaster.
RTO: 4-12 hours (time to scale up infrastructure + restore data) RPO: 5-60 minutes (near-real-time data replication) Cost: 2-3× backup/restore (DR region has minimal compute + data replication)
Architecture
Primary (us-east-1)
├── EC2 Auto Scaling Group (20 instances)
├── RDS Multi-AZ (db.r5.2xlarge)
├── ElastiCache (3-node cluster)
└── Application Load Balancer
DR Region (us-west-2) - "Pilot Light"
├── EC2 Auto Scaling Group (2 instances minimum, scaled down)
├── RDS Read Replica (db.r5.2xlarge) ← Continuous replication
├── ElastiCache (1-node cluster)
└── Application Load Balancer (unhealthy targets, not receiving traffic)
Route 53 Failover Routing
├── Primary: us-east-1 ALB (health check passing)
└── Secondary: us-west-2 ALB (only used if primary fails)
Key concept: Core infrastructure (database) continuously replicating, compute minimal but ready to scale.
Implementation
RDS cross-region read replica:
# Create read replica in DR region
aws rds create-db-instance-read-replica \
--db-instance-identifier mydb-replica-us-west-2 \
--source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:mydb \
--db-instance-class db.r5.2xlarge \
--region us-west-2
Auto Scaling Group (DR region, scaled down):
{
"AutoScalingGroupName": "app-asg-dr",
"MinSize": 2,
"MaxSize": 20,
"DesiredCapacity": 2,
"LaunchTemplate": {
"LaunchTemplateId": "lt-12345",
"Version": "$Latest"
},
"VPCZoneIdentifier": "subnet-dr-1,subnet-dr-2",
"TargetGroupARNs": [
"arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/app-tg-dr/abc123"
]
}
Route 53 failover routing:
{
"RecordSets": [
{
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "Primary",
"Failover": "PRIMARY",
"HealthCheckId": "health-check-us-east-1",
"AliasTarget": {
"DNSName": "app-alb-us-east-1.amazonaws.com",
"HostedZoneId": "Z123456"
}
},
{
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "Secondary",
"Failover": "SECONDARY",
"AliasTarget": {
"DNSName": "app-alb-us-west-2.amazonaws.com",
"HostedZoneId": "Z789012"
}
}
]
}
Recovery Process
Scenario: US-EAST-1 region failure
- Promote read replica to standalone (5-10 minutes):
aws rds promote-read-replica \ --db-instance-identifier mydb-replica-us-west-2 \ --region us-west-2 - Scale up Auto Scaling Group (10-20 minutes):
aws autoscaling set-desired-capacity \ --auto-scaling-group-name app-asg-dr \ --desired-capacity 20 \ --region us-west-2 - Route 53 automatic failover (1-3 minutes):
- Health check detects primary failure
- Traffic automatically routed to secondary (us-west-2)
- Verify and test (30-60 minutes):
- Database promotion complete (writable)
- ASG scaled to full capacity
- Application functional
Total RTO: 45 minutes to 1.5 hours
Cost
Primary region: $15,000/month DR region:
- EC2: 2 instances (vs 20) = $300/month
- RDS read replica: $1,500/month (same size as primary)
- ElastiCache: $100/month (1 node vs 3)
- Data replication: $500/month
- Total DR: $2,400/month
Total cost: $17,400/month (16% increase over single-region)
When to use:
- Business-critical applications (4-12h RTO acceptable)
- Budget-conscious (2× vs 6× for active-active)
- Infrequent failovers (quarterly testing acceptable)
Warm Standby
Pattern: Scaled-down but fully functional environment in DR region, scaled up during disaster.
RTO: 1-4 hours (time to scale up to full capacity) RPO: 5-15 minutes (near-real-time replication) Cost: 3-5× backup/restore (DR region running at 20-50% capacity)
Architecture
Primary (us-east-1) - 100% capacity
├── EC2 Auto Scaling Group (100 instances)
├── RDS Multi-AZ (db.r5.4xlarge)
├── ElastiCache (6-node cluster)
└── ALB (100% traffic)
DR Region (us-west-2) - 30% capacity
├── EC2 Auto Scaling Group (30 instances, scaled to 100 during failover)
├── Aurora Global Database (read replica, promoted to writer during failover)
├── ElastiCache (2-node cluster, scaled to 6 during failover)
└── ALB (receiving 0% traffic, but healthy and ready)
Route 53 Latency-Based Routing (or Failover)
├── us-east-1: 100% traffic (lower latency for US users)
└── us-west-2: 0% traffic (automatic failover if us-east-1 unhealthy)
Key differences from Pilot Light:
- Warm standby: DR region running at 30-50% capacity (not minimal)
- Pilot light: DR region minimal capacity (10% or less)
Implementation
Aurora Global Database (better than RDS for warm standby):
# Create global database
aws rds create-global-cluster \
--global-cluster-identifier myapp-global \
--engine aurora-postgresql \
--engine-version 14.6
# Primary cluster (us-east-1)
aws rds create-db-cluster \
--db-cluster-identifier myapp-primary \
--engine aurora-postgresql \
--global-cluster-identifier myapp-global \
--region us-east-1
# Secondary cluster (us-west-2)
aws rds create-db-cluster \
--db-cluster-identifier myapp-secondary \
--engine aurora-postgresql \
--global-cluster-identifier myapp-global \
--region us-west-2
Auto Scaling with scheduled scaling (DR region):
{
"ScheduledActions": [
{
"ScheduledActionName": "scale-up-for-failover-test",
"Schedule": "0 9 * * 1",
"MinSize": 100,
"MaxSize": 150,
"DesiredCapacity": 100
},
{
"ScheduledActionName": "scale-down-after-test",
"Schedule": "0 18 * * 1",
"MinSize": 30,
"MaxSize": 100,
"DesiredCapacity": 30
}
]
}
CloudWatch alarm to auto-scale during failover:
aws cloudwatch put-metric-alarm \
--alarm-name primary-region-down-scale-dr \
--alarm-description "Scale DR region when primary fails" \
--metric-name HealthCheckStatus \
--namespace AWS/Route53 \
--statistic Minimum \
--period 60 \
--threshold 0 \
--comparison-operator LessThanThreshold \
--datapoints-to-alarm 3 \
--evaluation-periods 3 \
--alarm-actions arn:aws:sns:us-west-2:123456789012:scale-dr-region
SNS topic triggers Lambda to scale DR:
import boto3
def lambda_handler(event, context):
autoscaling = boto3.client('autoscaling', region_name='us-west-2')
rds = boto3.client('rds', region_name='us-west-2')
# Scale EC2 Auto Scaling Group to full capacity
autoscaling.set_desired_capacity(
AutoScalingGroupName='app-asg-dr',
DesiredCapacity=100
)
# Promote Aurora secondary to primary (writable)
rds.failover_global_cluster(
GlobalClusterIdentifier='myapp-global',
TargetDbClusterIdentifier='myapp-secondary'
)
return {'statusCode': 200, 'body': 'DR failover initiated'}
Recovery Process
Automated failover (triggered by Route 53 health check failure):
- Route 53 detects primary failure (1-3 minutes)
- SNS triggers Lambda (30 seconds)
- Lambda promotes Aurora secondary (30-60 seconds)
- Lambda scales EC2 ASG (5-10 minutes to reach full capacity)
- Route 53 shifts traffic to us-west-2 (1-3 minutes DNS propagation)
Total RTO: 10-18 minutes (mostly ASG scale-up time)
Manual verification:
- Confirm Aurora promotion successful (now writable)
- Verify ASG reached desired capacity
- Run smoke tests on critical user journeys
Cost
Primary region: $15,000/month DR region (30% capacity):
- EC2: 30 instances (vs 100) = $4,500/month
- Aurora Global DB: $1,800/month (same size as primary, continuous replication)
- ElastiCache: $200/month (2 nodes vs 6)
- Data replication: $800/month
- Total DR: $7,300/month
Total cost: $22,300/month (49% increase over single-region)
When to use:
- Customer-facing applications (1-4h RTO target)
- Can justify 50% cost increase for faster recovery
- Frequent failover testing (monthly) acceptable
Multi-Site Active-Active
Pattern: Full production capacity in multiple regions, all serving traffic simultaneously.
RTO: <1 hour (typically seconds to minutes, automatic failover) RPO: <1 minute (continuous replication with minimal lag) Cost: 6-10× backup/restore (full capacity in 2+ regions)
Architecture
Users (Global)
↓
Route 53 Geolocation / Latency-Based Routing
↓
┌──────────────────────────────┬──────────────────────────────┐
│ us-east-1 (50% traffic) │ eu-west-1 (50% traffic) │
│ ├── EC2 ASG (100 instances)│ ├── EC2 ASG (100 instances)│
│ ├── Aurora Global (writer) │ ├── Aurora Global (writer) │
│ ├── DynamoDB Global Table │ ├── DynamoDB Global Table │
│ ├── ElastiCache (6 nodes) │ ├── ElastiCache (6 nodes) │
│ └── ALB (healthy) │ └── ALB (healthy) │
└──────────────────────────────┴──────────────────────────────┘
↑ ↑
└───── Cross-Region Replication ───┘
(Aurora, DynamoDB, S3, ElastiCache Global Datastore)
Key characteristics:
- Both regions serve production traffic (not standby)
- Database writes in both regions (eventual consistency with conflict resolution)
- Regional failure = traffic shifts to healthy region (automatic, seconds)
Implementation
DynamoDB Global Tables (multi-master):
aws dynamodb create-table \
--table-name users \
--attribute-definitions AttributeName=user_id,AttributeType=S \
--key-schema AttributeName=user_id,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES \
--region us-east-1
aws dynamodb update-table \
--table-name users \
--replica-updates '[
{"Create": {"RegionName": "eu-west-1"}},
{"Create": {"RegionName": "ap-southeast-1"}}
]' \
--region us-east-1
Aurora Global Database with managed failover:
# Enable managed planned failover
aws rds modify-global-cluster \
--global-cluster-identifier myapp-global \
--enable-global-write-forwarding
ElastiCache Global Datastore (Redis):
aws elasticache create-global-replication-group \
--global-replication-group-id-suffix myapp \
--primary-replication-group-id myapp-us-east-1 \
--region us-east-1
aws elasticache create-replication-group \
--replication-group-id myapp-eu-west-1 \
--global-replication-group-id myapp-global \
--replication-group-description "EU West 1 cache" \
--region eu-west-1
Route 53 health checks with automatic failover:
{
"RecordSets": [
{
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "US-EAST-1",
"GeoLocation": {"ContinentCode": "NA"},
"HealthCheckId": "health-check-us-east-1",
"AliasTarget": {
"DNSName": "app-alb-us-east-1.amazonaws.com"
}
},
{
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "EU-WEST-1",
"GeoLocation": {"ContinentCode": "EU"},
"HealthCheckId": "health-check-eu-west-1",
"AliasTarget": {
"DNSName": "app-alb-eu-west-1.amazonaws.com"
}
}
]
}
Recovery Process
Automatic failover (no manual intervention):
- Route 53 health check detects failure (30-90 seconds)
- Traffic automatically shifts to healthy region (DNS propagation 10-60 seconds)
- Users continue without interruption (may see brief latency increase)
Example: us-east-1 fails
- North America users (previously us-east-1): Automatically route to eu-west-1
- Latency increases from 50ms → 120ms, but service remains available
- Europe users (previously eu-west-1): No change
Manual verification:
- Confirm traffic shifted successfully (CloudWatch metrics)
- Verify database replication lag (should remain <1s)
- Monitor error rates (should remain low)
Total RTO: 1-3 minutes (health check + DNS propagation)
Cost
Region 1 (us-east-1): $15,000/month Region 2 (eu-west-1): $15,000/month Data replication:
- Aurora Global DB: $800/month
- DynamoDB Global Tables: $1,200/month (replicated writes)
- ElastiCache Global Datastore: $400/month
- S3 CRR: $500/month
- Total replication: $2,900/month
Total cost: $32,900/month (119% increase over single-region, 2.2× total)
When to use:
- Mission-critical applications (cannot tolerate downtime)
- Global user base (latency optimization + HA)
- Regulatory compliance (data residency in multiple regions)
- Can justify 2× cost for near-zero RTO/RPO
Data Backup Strategies
Database Backups
RDS automated backups:
- Retention: 7-35 days (configurable)
- Frequency: Daily full backup + transaction logs every 5 minutes
- RPO: 5 minutes (point-in-time recovery)
- Cost: Free (included in RDS pricing)
Manual snapshots:
- Retention: Indefinite (until manually deleted)
- Use case: Pre-upgrade snapshots, long-term retention
- Cost: $0.095/GB-month (same as automated backups after retention period)
Cross-region snapshot copy:
# Automated cross-region copy
aws rds modify-db-instance \
--db-instance-identifier mydb \
--backup-retention-period 7 \
--copy-tags-to-snapshot \
--enable-cloudwatch-logs-exports '["error","general","slowquery"]'
# Manual cross-region copy
aws rds copy-db-snapshot \
--source-db-snapshot-identifier arn:aws:rds:us-east-1:123456789012:snapshot:mydb-2024-11-15 \
--target-db-snapshot-identifier mydb-dr-2024-11-15 \
--source-region us-east-1 \
--region us-west-2 \
--kms-key-id arn:aws:kms:us-west-2:123456789012:key/abc-123
File System Backups
EBS snapshots:
- Frequency: Daily/hourly with Data Lifecycle Manager
- Retention: 7-30 days typical
- Incremental: Only changed blocks stored
- Cost: $0.05/GB-month
EFS backups (AWS Backup):
aws backup create-backup-plan \
--backup-plan '{
"BackupPlanName": "EFSDaily",
"Rules": [{
"RuleName": "DailyBackup",
"TargetBackupVaultName": "Default",
"ScheduleExpression": "cron(0 5 * * ? *)",
"StartWindowMinutes": 60,
"CompletionWindowMinutes": 120,
"Lifecycle": {
"DeleteAfterDays": 30
}
}]
}'
Application Data Backups
S3 versioning + lifecycle policies:
{
"Rules": [
{
"Id": "Archive old versions",
"Status": "Enabled",
"Filter": {},
"NoncurrentVersionTransitions": [
{
"NoncurrentDays": 30,
"StorageClass": "STANDARD_IA"
},
{
"NoncurrentDays": 90,
"StorageClass": "GLACIER"
}
],
"NoncurrentVersionExpiration": {
"NoncurrentDays": 365
}
}
]
}
DynamoDB backups:
# Enable point-in-time recovery (PITR)
aws dynamodb update-continuous-backups \
--table-name users \
--point-in-time-recovery-specification PointInTimeRecoveryEnabled=true
# PITR provides 35-day recovery window
# Cost: No additional charge (included in table pricing)
# On-demand backups for long-term retention
aws dynamodb create-backup \
--table-name users \
--backup-name users-backup-2024-11-15
Testing and Validation
Disaster Recovery Testing Cadence
Quarterly full failover tests:
- Fail over to DR region
- Verify RTO/RPO targets met
- Document actual recovery time vs target
- Identify gaps, update runbooks
Monthly component tests:
- Test database restore from backup
- Test AMI launch in DR region
- Test Route 53 failover routing
- Validate monitoring/alerting
Weekly backup validation:
- Verify automated backups completing
- Test restore of recent backup (non-production)
- Confirm cross-region replication working
DR Test Plan Template
DR Test: Quarterly Failover to us-west-2
Date: 2024-11-15
Participants: Ops team, Engineering leads, Security
Objectives:
- Verify RTO < 4 hours (target)
- Verify RPO < 1 hour (target)
- Test runbook accuracy
- Identify improvement opportunities
Pre-Test Checklist:
- [ ] Notify stakeholders (internal users, customers if needed)
- [ ] Verify backups current (< 24h old)
- [ ] Confirm DR region resources healthy
- [ ] Schedule test window (low-traffic period)
Test Steps:
1. [ ] T+0min: Simulate primary region failure (block health check)
2. [ ] T+3min: Verify Route 53 failover triggered
3. [ ] T+5min: Promote RDS read replica in us-west-2
4. [ ] T+15min: Scale up EC2 Auto Scaling Group
5. [ ] T+20min: Run smoke tests (critical user journeys)
6. [ ] T+30min: Verify application fully functional
7. [ ] T+60min: Monitor error rates, latency, throughput
8. [ ] T+240min: Failback to primary region
Success Criteria:
- Application operational in DR region within 4 hours
- Data loss < 1 hour (verify last transaction timestamp)
- Error rate < 1% during failover
- No data corruption detected
Post-Test:
- Document actual RTO/RPO
- Update runbooks with lessons learned
- Create tickets for identified issues
- Schedule next test
Automated DR Validation
AWS Backup audit:
# Check backup compliance
aws backup list-backup-jobs \
--by-state COMPLETED \
--max-results 100 \
--query 'BackupJobs[?CompletionDate>=`2024-11-01`]'
# Alert if backup older than 24 hours
LAST_BACKUP=$(aws rds describe-db-snapshots \
--db-instance-identifier mydb \
--query 'DBSnapshots | sort_by(@, &SnapshotCreateTime)[-1].SnapshotCreateTime' \
--output text)
HOURS_SINCE_BACKUP=$(( ($(date +%s) - $(date -d "$LAST_BACKUP" +%s)) / 3600 ))
if [ $HOURS_SINCE_BACKUP -gt 24 ]; then
echo "WARNING: Last backup > 24 hours old"
fi
Replication lag monitoring:
# Aurora Global DB replication lag
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name AuroraGlobalDBReplicationLag \
--dimensions Name=DBClusterIdentifier,Value=myapp-global \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Maximum
Cost Optimization
DR Cost Comparison
Baseline (single-region): $15,000/month
| Strategy | Monthly Cost | Increase | RTO | RPO |
|---|---|---|---|---|
| Backup & Restore | $15,440 | 3% | 12-24h | 1-24h |
| Pilot Light | $17,400 | 16% | 4-12h | 5-60min |
| Warm Standby | $22,300 | 49% | 1-4h | 5-15min |
| Multi-Site | $32,900 | 119% | <1h | <1min |
Optimization Techniques
1. Tiered storage for backups:
{
"Rules": [
{
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER"},
{"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
]
}
]
}
Savings: Standard ($0.023/GB) → Deep Archive ($0.00099/GB) = 96% reduction
2. Fargate Spot for DR compute:
Pilot light DR region:
- Fargate On-Demand: 10 vCPUs × $0.04/hour = $288/month
- Fargate Spot: 10 vCPUs × $0.012/hour = $86/month
- Savings: $202/month (70%)
3. Aurora Serverless v2 for DR:
Warm standby with Aurora Serverless v2 (scales down when idle):
- Standard Aurora: $1,800/month (always provisioned)
- Aurora Serverless v2: $500/month average (scales 0.5-1 ACU most of time)
- Savings: $1,300/month (72%)
4. Scheduled scaling for DR region:
Scale down DR region nights/weekends (if acceptable):
- Standard: 30 instances × 24h/day × 30 days = 21,600 instance-hours
- Scheduled (12h/day weekdays only): 30 instances × 12h × 20 days = 7,200 instance-hours
- Savings: 67% compute cost reduction
5. Right-size DR instances:
Production uses m5.2xlarge, DR can use m5.large (half the cost) if acceptable:
- Production: 100 × m5.2xlarge × $0.384/hour = $2,765/month
- DR: 30 × m5.large × $0.096/hour = $691/month (vs $1,382 if same instance type)
- Savings: $691/month (50%)
When to Use Which DR Strategy
Decision Framework
1. Determine acceptable RTO/RPO (business impact analysis):
- Financial loss per hour of downtime
- Regulatory requirements
- Customer expectations (SLA commitments)
2. Calculate DR strategy costs (from cost comparison table)
3. Compare cost vs risk:
Example:
- Downtime cost: $100,000/hour
- Expected outages: 1 per year (based on historical AWS availability)
- Downtime without DR: 24 hours = $2.4M loss
Backup & Restore: $440/month = $5,280/year
- Prevents: ~12 hours of downtime (still 12h outage) = Saves $1.2M
- ROI: ($1.2M - $5K) / $5K = 23,900% ROI
Warm Standby: $7,300/month = $87,600/year
- Prevents: ~22 hours of downtime (2h outage) = Saves $2.2M
- ROI: ($2.2M - $88K) / $88K = 2,400% ROI
Multi-Site: $17,900/month = $214,800/year
- Prevents: ~23.5 hours of downtime (30min outage) = Saves $2.35M
- ROI: ($2.35M - $215K) / $215K = 994% ROI
Recommendation: Warm Standby (best ROI given 1/year outage assumption)
Use Case Matrix
| Application Type | RTO Target | RPO Target | Recommended Strategy |
|---|---|---|---|
| Internal dev/test | Days | 24h | Backup & Restore |
| Internal tools | 12-24h | 4-24h | Backup & Restore |
| CRM, analytics | 4-12h | 1-4h | Pilot Light |
| E-commerce | 1-4h | 15min-1h | Warm Standby |
| Financial trading | <1h | <1min | Multi-Site Active |
| Payment processing | <30min | <1min | Multi-Site Active |
Common Pitfalls
1. Not Testing DR Regularly
Problem: DR plan looks good on paper, but never tested. During real disaster, discover RDS snapshot restore takes 8 hours (not 2), ASG launch config references deleted AMI.
Solution: Quarterly full failover tests, document actual RTO/RPO, update runbooks.
2. Forgetting Application Dependencies
Problem: Failover database and compute to DR region, but application calls third-party API hardcoded to us-east-1 (e.g., Redis cache, message queue). Application fails despite infrastructure healthy.
Solution: Inventory all dependencies (databases, caches, queues, external APIs), ensure DR plan includes all components.
3. Insufficient DR Capacity
Problem: Primary region handles 10,000 RPS, DR region provisioned for 3,000 RPS (30% capacity). During failover, DR region overwhelmed, crashes.
Solution: Load test DR region at expected production load, ensure capacity sufficient or ASG configured to scale rapidly.
4. Not Monitoring Replication Lag
Problem: Aurora Global DB normally <1s lag, but during peak load increases to 5 minutes. Failover during high lag = 5 minutes of data loss (exceeds RPO).
Solution: Monitor replication lag, alert if exceeds threshold, understand lag patterns (increases during peak writes).
5. Hardcoded Region in Application
Problem:
S3_BUCKET = "myapp-data-us-east-1" # ❌ Hardcoded
Application fails over to eu-west-1 but still reads from us-east-1 S3 bucket (cross-region latency or unavailable).
Solution: Use environment variables, detect region dynamically:
REGION = os.environ['AWS_REGION']
S3_BUCKET = f"myapp-data-{REGION}"
6. Ignoring Data Consistency
Problem: Multi-site active-active with DynamoDB Global Tables. Concurrent writes to same item from us-east-1 and eu-west-1 cause last-writer-wins conflict, data loss.
Solution: Design for eventual consistency, use conflict-free data structures (CRDTs), or centralize writes to single region.
7. No Runbook Automation
Problem: DR runbook is 50-page Word document. During real disaster, team scrambles to follow manual steps, makes mistakes, takes 6 hours instead of 2.
Solution: Automate DR failover with scripts/Lambda, regularly test automation, keep runbook as fallback for edge cases.
8. Underestimating DNS TTL Impact
Problem: Route 53 configured with 60s TTL. Health check detects failure in 30s, but clients cached DNS response for 60s. Actual failover time: 90s instead of 30s.
Solution: Lower TTL for critical records (10-30s), accept higher Route 53 query costs.
Key Takeaways
DR Strategies:
- Backup & Restore: RTO 12-24h, RPO 1-24h, Cost baseline +3%, use for dev/test and non-critical apps
- Pilot Light: RTO 4-12h, RPO 5-60min, Cost +16%, minimal DR infrastructure with continuous DB replication
- Warm Standby: RTO 1-4h, RPO 5-15min, Cost +49%, scaled-down DR (20-50% capacity) ready to scale up
- Multi-Site Active-Active: RTO <1h, RPO <1min, Cost +119%, full capacity in 2+ regions serving traffic
RTO/RPO Fundamentals:
- RTO = Maximum acceptable downtime (time to recover)
- RPO = Maximum acceptable data loss (time since last backup)
- Aggressive RTO/RPO exponentially increases cost (1h RTO costs 10× vs 24h RTO)
Data Replication:
- RDS automated backups: Free, 7-35 day retention, 5-minute RPO with transaction logs
- Aurora Global Database: <1s replication lag, <1min failover, read-only secondaries
- DynamoDB Global Tables: Multi-master, <1s lag, last-writer-wins conflict resolution
- S3 CRR: <15min replication (99.99% SLA with RTC), $0.02-0.035/GB
Cost Optimization:
- Tiered storage for backups: Standard → Deep Archive = 96% reduction
- Fargate Spot for DR compute: 70% discount (accept interruption risk)
- Aurora Serverless v2 for DR: 72% savings (scales down when idle)
- Right-size DR instances: Use smaller instances in DR (m5.2xlarge → m5.large) = 50% savings
Testing:
- Quarterly full failover tests (verify RTO/RPO targets met)
- Monthly component tests (database restore, AMI launch, Route 53 failover)
- Weekly backup validation (verify automated backups completing)
- Automate DR testing with scripts/Lambda (manual runbooks error-prone)
Common Pitfalls:
- Not testing DR regularly (discover issues during real disaster)
- Forgetting application dependencies (database fails over but Redis cache doesn’t)
- Insufficient DR capacity (DR region overwhelmed during failover)
- Hardcoding region in application code (cross-region calls after failover)
- Ignoring replication lag (failover during high lag exceeds RPO)
Decision Framework:
- Calculate downtime cost per hour (revenue loss + customer loss + reputation)
- Determine acceptable RTO/RPO (business impact analysis)
- Select DR strategy (cost vs risk analysis, ROI calculation)
- Test quarterly, document actual RTO/RPO vs target
- Continuously improve (update runbooks, automate, reduce costs)
When to Use:
- Backup & Restore: Dev/test, internal tools, can tolerate 12-24h RTO, cost-sensitive
- Pilot Light: Business-critical apps, 4-12h RTO acceptable, budget-conscious
- Warm Standby: Customer-facing apps, 1-4h RTO target, can justify 50% cost increase
- Multi-Site Active-Active: Mission-critical apps, cannot tolerate downtime, global users, 2× cost acceptable
The right DR strategy balances cost, complexity, and risk tolerance based on business requirements and acceptable downtime/data loss.
Found this guide helpful? Share it with your team:
Share on LinkedIn