Disaster Recovery on AWS for System Architects

What Problems Disaster Recovery Solves
RTO and RPO Fundamentals
DR Strategy Selection Framework
Backup and Restore
Pilot Light
Warm Standby
Multi-Site Active-Active
Data Backup Strategies
Testing and Validation
Cost Optimization
When to Use Which DR Strategy
Common Pitfalls
Key Takeaways

What Problems Disaster Recovery Solves

Disasters that cause data loss or downtime:

Regional failures: AWS region outages (rare but happen)
- Example: US-EAST-1 outage December 2021 (7+ hours)
- Impact: 100% downtime if single-region, no DR
Data corruption: Application bugs, ransomware, accidental deletions
- Example: DELETE FROM users WHERE 1=1 (oops, forgot WHERE clause)
- Impact: Data loss, cannot restore without backups
Natural disasters: Hurricanes, earthquakes, floods affecting data centers
- Example: Hurricane affecting entire AWS region
- Impact: Extended outage if data center physically damaged
Human error: Misconfigurations, accidental resource deletions
- Example: Terraform destroy in production instead of staging
- Impact: Infrastructure destroyed, requires rebuild from backups
Security incidents: Ransomware, data breach, malicious deletion
- Example: Attacker gains admin access, deletes all RDS databases
- Impact: Data loss, compliance violations, business continuity risk

DR solves:

Business continuity: Continue operating during/after disaster
Data protection: Restore data to point before corruption/loss
Compliance: Meet regulatory requirements for data retention and recovery
Risk mitigation: Reduce financial impact of downtime

RTO and RPO Fundamentals

Definitions

Recovery Time Objective (RTO): Maximum acceptable downtime

Disaster occurs → Recovery begins → Systems operational
|<-------------- RTO ------------->|

Example: “Our RTO is 4 hours” means systems must be operational within 4 hours of disaster.

Recovery Point Objective (RPO): Maximum acceptable data loss

Last successful backup → Disaster occurs
|<------- RPO ------->|
Data created in this window is lost

Example: “Our RPO is 1 hour” means we can tolerate losing up to 1 hour of data.

Cost-Complexity Spectrum

RTO	RPO	DR Strategy	Complexity	Cost (relative to backup/restore baseline)
24h	24h	Backup & Restore	Low	1× (baseline)
4-12h	1-4h	Pilot Light	Medium	2-3×
1-4h	5-15min	Warm Standby	Medium-High	3-5×
<1h	<1min	Multi-Site Active-Active	High	6-10×

Key insight: Aggressive RTO/RPO targets exponentially increase cost and complexity.

Business Impact Analysis

Determine appropriate RTO/RPO:

Calculate downtime cost:
- Revenue per hour: $50,000/hour
- Lost customers per hour of downtime: 100
- Customer lifetime value: $1,000
- Reputation damage: Difficult to quantify but significant
Cost of 1-hour outage: $50,000 (revenue) + $100,000 (lost customers) = $150,000
Evaluate data loss impact:
- Financial transactions: Cannot lose any (RPO near-zero)
- User-generated content: Can tolerate 15 minutes (RPO 15min)
- Logs and analytics: Can tolerate 24 hours (RPO 24h)
Balance cost vs risk:
- Multi-site active-active: $30,000/month to prevent $150,000 loss
- ROI: $30K/month = $360K/year to protect against $150K per outage
- If outages happen 1/year: Not worth it ($360K > $150K)
- If outages happen 3+/year: Worth it ($360K < $450K)

DR Strategy Selection Framework

Decision Matrix

Step 1: Determine RTO/RPO requirements (from business impact analysis)

Step 2: Select DR strategy:

┌─────────────────────────────────────────────────────────────┐
│                    DR Strategy Selection                     │
├─────────────────┬───────────┬──────────┬─────────────────────┤
│ Business Need   │ RTO       │ RPO      │ Recommended Strategy│
├─────────────────┼───────────┼──────────┼─────────────────────┤
│ Dev/Test        │ Days      │ 24h      │ Backup & Restore    │
│ Internal Tools  │ 12-24h    │ 4-24h    │ Backup & Restore    │
│ Business Apps   │ 4-12h     │ 1-4h     │ Pilot Light         │
│ Customer-Facing │ 1-4h      │ 5-15min  │ Warm Standby        │
│ Mission-Critical│ <1h       │ <1min    │ Multi-Site Active   │
└─────────────────┴───────────┴──────────┴─────────────────────┘

Step 3: Validate cost-benefit:

Estimated DR cost per month: $X
Estimated outage cost: $Y per incident
Expected outage frequency: Z per year
Breakeven: $X × 12 months = $Y × Z incidents

If DR cost < outage cost × frequency: Invest in DR If DR cost > outage cost × frequency: Accept risk or choose cheaper DR strategy

Backup and Restore

Pattern: Regular backups stored offsite, restore when needed.

RTO: 12-24 hours (time to provision infrastructure + restore data) RPO: 1-24 hours (time since last backup) Cost: Baseline (1×) - only pay for storage

Architecture

Production (us-east-1)
├── EC2 instances → AMIs (daily snapshots)
├── RDS database → Automated backups (daily) + Manual snapshots (weekly)
├── S3 buckets → Versioning enabled, lifecycle policies
└── EBS volumes → Daily snapshots

Backups stored in:
├── Same region (different AZ) - Fast restore, same-region risk
└── Cross-region (us-west-2) - Slow restore, regional failure protection

Implementation

RDS automated backups:

# Enable automated backups (retention 7-35 days)
aws rds modify-db-instance \
  --db-instance-identifier mydb \
  --backup-retention-period 7 \
  --preferred-backup-window "03:00-04:00" \
  --region us-east-1

# Copy automated backup to another region
aws rds copy-db-snapshot \
  --source-db-snapshot-identifier arn:aws:rds:us-east-1:123456789012:snapshot:mydb-snapshot \
  --target-db-snapshot-identifier mydb-snapshot-dr \
  --source-region us-east-1 \
  --region us-west-2

EBS snapshots (automated with Data Lifecycle Manager):

{
  "PolicyDetails": {
    "ResourceTypes": ["VOLUME"],
    "TargetTags": [
      {"Key": "Backup", "Value": "true"}
    ],
    "Schedules": [
      {
        "Name": "Daily Snapshots",
        "CreateRule": {
          "Interval": 24,
          "IntervalUnit": "HOURS",
          "Times": ["03:00"]
        },
        "RetainRule": {
          "Count": 7
        },
        "CopyTags": true,
        "CrossRegionCopyRules": [
          {
            "TargetRegion": "us-west-2",
            "Encrypted": true,
            "RetainRule": {
              "Interval": 7,
              "IntervalUnit": "DAYS"
            }
          }
        ]
      }
    ]
  }
}

S3 cross-region replication (for backup):

{
  "Role": "arn:aws:iam::123456789012:role/s3-replication-role",
  "Rules": [
    {
      "Status": "Enabled",
      "Priority": 1,
      "Filter": {},
      "Destination": {
        "Bucket": "arn:aws:s3:::myapp-backup-us-west-2",
        "ReplicationTime": {
          "Status": "Enabled",
          "Time": {"Minutes": 15}
        }
      },
      "DeleteMarkerReplication": {"Status": "Enabled"}
    }
  ]
}

Infrastructure as Code (backup via Git):

# Store Terraform/CloudFormation in Git
git commit -am "Production infrastructure state"
git push origin main

# In DR region, recreate from IaC
git clone https://github.com/myorg/infrastructure
cd infrastructure
terraform init
terraform plan -var-file=dr-region.tfvars
terraform apply

Recovery Process

Scenario: US-EAST-1 region failure, recover to US-WEST-2

Provision infrastructure (4-8 hours):

# Launch EC2 instances from AMI snapshots copied to us-west-2
aws ec2 run-instances \
  --image-id ami-backup-12345 \
  --instance-type m5.large \
  --count 10 \
  --region us-west-2

Restore database (2-6 hours):

# Restore RDS from snapshot in us-west-2
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier mydb-restored \
  --db-snapshot-identifier mydb-snapshot-dr \
  --region us-west-2

Update DNS (5-60 minutes):

# Update Route 53 to point to us-west-2 resources
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "AliasTarget": {
          "DNSName": "us-west-2-alb.amazonaws.com",
          "HostedZoneId": "Z789012"
        }
      }
    }]
  }'

Verify and test (1-2 hours):
- Smoke tests: Critical user journeys functional
- Data integrity: Recent backup data present
- Performance: Systems handling expected load

Total RTO: 8-17 hours (infrastructure + database + DNS + testing)

Cost

Storage costs:

RDS backups: Free (within retention period)
EBS snapshots: $0.05/GB-month (incremental)
S3 CRR: $0.023/GB-month (Standard) + $0.02/GB replication

Example:

RDS: 500GB database, 7-day retention = $0 (included)
EBS: 1TB snapshots (incremental 20% change/day) = 200GB × $0.05 = $10/month
S3: 10TB data replicated to DR region = 10TB × $0.023 = $230/month + 10TB × $0.02 = $200 replication
Total: $440/month

When to use:

Non-critical applications (tolerate 12-24h RTO)
Infrequent disasters (cost of outage < cost of more expensive DR)
Development/test environments

Pilot Light

Pattern: Minimal infrastructure running in DR region, scaled up during disaster.

RTO: 4-12 hours (time to scale up infrastructure + restore data) RPO: 5-60 minutes (near-real-time data replication) Cost: 2-3× backup/restore (DR region has minimal compute + data replication)

Architecture

Primary (us-east-1)
├── EC2 Auto Scaling Group (20 instances)
├── RDS Multi-AZ (db.r5.2xlarge)
├── ElastiCache (3-node cluster)
└── Application Load Balancer

DR Region (us-west-2) - "Pilot Light"
├── EC2 Auto Scaling Group (2 instances minimum, scaled down)
├── RDS Read Replica (db.r5.2xlarge) ← Continuous replication
├── ElastiCache (1-node cluster)
└── Application Load Balancer (unhealthy targets, not receiving traffic)

Route 53 Failover Routing
├── Primary: us-east-1 ALB (health check passing)
└── Secondary: us-west-2 ALB (only used if primary fails)

Key concept: Core infrastructure (database) continuously replicating, compute minimal but ready to scale.

Implementation

RDS cross-region read replica:

# Create read replica in DR region
aws rds create-db-instance-read-replica \
  --db-instance-identifier mydb-replica-us-west-2 \
  --source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:mydb \
  --db-instance-class db.r5.2xlarge \
  --region us-west-2

Auto Scaling Group (DR region, scaled down):

{
  "AutoScalingGroupName": "app-asg-dr",
  "MinSize": 2,
  "MaxSize": 20,
  "DesiredCapacity": 2,
  "LaunchTemplate": {
    "LaunchTemplateId": "lt-12345",
    "Version": "$Latest"
  },
  "VPCZoneIdentifier": "subnet-dr-1,subnet-dr-2",
  "TargetGroupARNs": [
    "arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/app-tg-dr/abc123"
  ]
}

Route 53 failover routing:

{
  "RecordSets": [
    {
      "Name": "app.example.com",
      "Type": "A",
      "SetIdentifier": "Primary",
      "Failover": "PRIMARY",
      "HealthCheckId": "health-check-us-east-1",
      "AliasTarget": {
        "DNSName": "app-alb-us-east-1.amazonaws.com",
        "HostedZoneId": "Z123456"
      }
    },
    {
      "Name": "app.example.com",
      "Type": "A",
      "SetIdentifier": "Secondary",
      "Failover": "SECONDARY",
      "AliasTarget": {
        "DNSName": "app-alb-us-west-2.amazonaws.com",
        "HostedZoneId": "Z789012"
      }
    }
  ]
}

Recovery Process

Scenario: US-EAST-1 region failure

Promote read replica to standalone (5-10 minutes):

aws rds promote-read-replica \
  --db-instance-identifier mydb-replica-us-west-2 \
  --region us-west-2

Scale up Auto Scaling Group (10-20 minutes):

aws autoscaling set-desired-capacity \
  --auto-scaling-group-name app-asg-dr \
  --desired-capacity 20 \
  --region us-west-2

Route 53 automatic failover (1-3 minutes):
- Health check detects primary failure
- Traffic automatically routed to secondary (us-west-2)
Verify and test (30-60 minutes):
- Database promotion complete (writable)
- ASG scaled to full capacity
- Application functional

Total RTO: 45 minutes to 1.5 hours

Cost

Primary region: $15,000/month DR region:

EC2: 2 instances (vs 20) = $300/month
RDS read replica: $1,500/month (same size as primary)
ElastiCache: $100/month (1 node vs 3)
Data replication: $500/month
Total DR: $2,400/month

Total cost: $17,400/month (16% increase over single-region)

When to use:

Business-critical applications (4-12h RTO acceptable)
Budget-conscious (2× vs 6× for active-active)
Infrequent failovers (quarterly testing acceptable)

Warm Standby

Pattern: Scaled-down but fully functional environment in DR region, scaled up during disaster.

RTO: 1-4 hours (time to scale up to full capacity) RPO: 5-15 minutes (near-real-time replication) Cost: 3-5× backup/restore (DR region running at 20-50% capacity)

Architecture

Primary (us-east-1) - 100% capacity
├── EC2 Auto Scaling Group (100 instances)
├── RDS Multi-AZ (db.r5.4xlarge)
├── ElastiCache (6-node cluster)
└── ALB (100% traffic)

DR Region (us-west-2) - 30% capacity
├── EC2 Auto Scaling Group (30 instances, scaled to 100 during failover)
├── Aurora Global Database (read replica, promoted to writer during failover)
├── ElastiCache (2-node cluster, scaled to 6 during failover)
└── ALB (receiving 0% traffic, but healthy and ready)

Route 53 Latency-Based Routing (or Failover)
├── us-east-1: 100% traffic (lower latency for US users)
└── us-west-2: 0% traffic (automatic failover if us-east-1 unhealthy)

Key differences from Pilot Light:

Warm standby: DR region running at 30-50% capacity (not minimal)
Pilot light: DR region minimal capacity (10% or less)

Implementation

Aurora Global Database (better than RDS for warm standby):

# Create global database
aws rds create-global-cluster \
  --global-cluster-identifier myapp-global \
  --engine aurora-postgresql \
  --engine-version 14.6

# Primary cluster (us-east-1)
aws rds create-db-cluster \
  --db-cluster-identifier myapp-primary \
  --engine aurora-postgresql \
  --global-cluster-identifier myapp-global \
  --region us-east-1

# Secondary cluster (us-west-2)
aws rds create-db-cluster \
  --db-cluster-identifier myapp-secondary \
  --engine aurora-postgresql \
  --global-cluster-identifier myapp-global \
  --region us-west-2

Auto Scaling with scheduled scaling (DR region):

{
  "ScheduledActions": [
    {
      "ScheduledActionName": "scale-up-for-failover-test",
      "Schedule": "0 9 * * 1",
      "MinSize": 100,
      "MaxSize": 150,
      "DesiredCapacity": 100
    },
    {
      "ScheduledActionName": "scale-down-after-test",
      "Schedule": "0 18 * * 1",
      "MinSize": 30,
      "MaxSize": 100,
      "DesiredCapacity": 30
    }
  ]
}

CloudWatch alarm to auto-scale during failover:

aws cloudwatch put-metric-alarm \
  --alarm-name primary-region-down-scale-dr \
  --alarm-description "Scale DR region when primary fails" \
  --metric-name HealthCheckStatus \
  --namespace AWS/Route53 \
  --statistic Minimum \
  --period 60 \
  --threshold 0 \
  --comparison-operator LessThanThreshold \
  --datapoints-to-alarm 3 \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-west-2:123456789012:scale-dr-region

SNS topic triggers Lambda to scale DR:

import boto3

def lambda_handler(event, context):
    autoscaling = boto3.client('autoscaling', region_name='us-west-2')
    rds = boto3.client('rds', region_name='us-west-2')

    # Scale EC2 Auto Scaling Group to full capacity
    autoscaling.set_desired_capacity(
        AutoScalingGroupName='app-asg-dr',
        DesiredCapacity=100
    )

    # Promote Aurora secondary to primary (writable)
    rds.failover_global_cluster(
        GlobalClusterIdentifier='myapp-global',
        TargetDbClusterIdentifier='myapp-secondary'
    )

    return {'statusCode': 200, 'body': 'DR failover initiated'}

Recovery Process

Automated failover (triggered by Route 53 health check failure):

Route 53 detects primary failure (1-3 minutes)
SNS triggers Lambda (30 seconds)
Lambda promotes Aurora secondary (30-60 seconds)
Lambda scales EC2 ASG (5-10 minutes to reach full capacity)
Route 53 shifts traffic to us-west-2 (1-3 minutes DNS propagation)

Total RTO: 10-18 minutes (mostly ASG scale-up time)

Manual verification:

Confirm Aurora promotion successful (now writable)
Verify ASG reached desired capacity
Run smoke tests on critical user journeys

Cost

Primary region: $15,000/month DR region (30% capacity):

EC2: 30 instances (vs 100) = $4,500/month
Aurora Global DB: $1,800/month (same size as primary, continuous replication)
ElastiCache: $200/month (2 nodes vs 6)
Data replication: $800/month
Total DR: $7,300/month

Total cost: $22,300/month (49% increase over single-region)

When to use:

Customer-facing applications (1-4h RTO target)
Can justify 50% cost increase for faster recovery
Frequent failover testing (monthly) acceptable

Multi-Site Active-Active

Pattern: Full production capacity in multiple regions, all serving traffic simultaneously.

RTO: <1 hour (typically seconds to minutes, automatic failover) RPO: <1 minute (continuous replication with minimal lag) Cost: 6-10× backup/restore (full capacity in 2+ regions)

Architecture

Users (Global)
    ↓
Route 53 Geolocation / Latency-Based Routing
    ↓
┌──────────────────────────────┬──────────────────────────────┐
│   us-east-1 (50% traffic)    │   eu-west-1 (50% traffic)    │
│   ├── EC2 ASG (100 instances)│   ├── EC2 ASG (100 instances)│
│   ├── Aurora Global (writer) │   ├── Aurora Global (writer) │
│   ├── DynamoDB Global Table  │   ├── DynamoDB Global Table  │
│   ├── ElastiCache (6 nodes)  │   ├── ElastiCache (6 nodes)  │
│   └── ALB (healthy)           │   └── ALB (healthy)           │
└──────────────────────────────┴──────────────────────────────┘
         ↑                                  ↑
         └───── Cross-Region Replication ───┘
           (Aurora, DynamoDB, S3, ElastiCache Global Datastore)

Key characteristics:

Both regions serve production traffic (not standby)
Database writes in both regions (eventual consistency with conflict resolution)
Regional failure = traffic shifts to healthy region (automatic, seconds)

Implementation

DynamoDB Global Tables (multi-master):

aws dynamodb create-table \
  --table-name users \
  --attribute-definitions AttributeName=user_id,AttributeType=S \
  --key-schema AttributeName=user_id,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES \
  --region us-east-1

aws dynamodb update-table \
  --table-name users \
  --replica-updates '[
    {"Create": {"RegionName": "eu-west-1"}},
    {"Create": {"RegionName": "ap-southeast-1"}}
  ]' \
  --region us-east-1

Aurora Global Database with managed failover:

# Enable managed planned failover
aws rds modify-global-cluster \
  --global-cluster-identifier myapp-global \
  --enable-global-write-forwarding

ElastiCache Global Datastore (Redis):

aws elasticache create-global-replication-group \
  --global-replication-group-id-suffix myapp \
  --primary-replication-group-id myapp-us-east-1 \
  --region us-east-1

aws elasticache create-replication-group \
  --replication-group-id myapp-eu-west-1 \
  --global-replication-group-id myapp-global \
  --replication-group-description "EU West 1 cache" \
  --region eu-west-1

Route 53 health checks with automatic failover:

{
  "RecordSets": [
    {
      "Name": "app.example.com",
      "Type": "A",
      "SetIdentifier": "US-EAST-1",
      "GeoLocation": {"ContinentCode": "NA"},
      "HealthCheckId": "health-check-us-east-1",
      "AliasTarget": {
        "DNSName": "app-alb-us-east-1.amazonaws.com"
      }
    },
    {
      "Name": "app.example.com",
      "Type": "A",
      "SetIdentifier": "EU-WEST-1",
      "GeoLocation": {"ContinentCode": "EU"},
      "HealthCheckId": "health-check-eu-west-1",
      "AliasTarget": {
        "DNSName": "app-alb-eu-west-1.amazonaws.com"
      }
    }
  ]
}

Recovery Process

Automatic failover (no manual intervention):

Route 53 health check detects failure (30-90 seconds)
Traffic automatically shifts to healthy region (DNS propagation 10-60 seconds)
Users continue without interruption (may see brief latency increase)

Example: us-east-1 fails

North America users (previously us-east-1): Automatically route to eu-west-1
Latency increases from 50ms → 120ms, but service remains available
Europe users (previously eu-west-1): No change

Manual verification:

Confirm traffic shifted successfully (CloudWatch metrics)
Verify database replication lag (should remain <1s)
Monitor error rates (should remain low)

Total RTO: 1-3 minutes (health check + DNS propagation)

Cost

Region 1 (us-east-1): $15,000/month Region 2 (eu-west-1): $15,000/month Data replication:

Aurora Global DB: $800/month
DynamoDB Global Tables: $1,200/month (replicated writes)
ElastiCache Global Datastore: $400/month
S3 CRR: $500/month
Total replication: $2,900/month

Total cost: $32,900/month (119% increase over single-region, 2.2× total)

When to use:

Mission-critical applications (cannot tolerate downtime)
Global user base (latency optimization + HA)
Regulatory compliance (data residency in multiple regions)
Can justify 2× cost for near-zero RTO/RPO

Data Backup Strategies

Database Backups

RDS automated backups:

Retention: 7-35 days (configurable)
Frequency: Daily full backup + transaction logs every 5 minutes
RPO: 5 minutes (point-in-time recovery)
Cost: Free (included in RDS pricing)

Manual snapshots:

Retention: Indefinite (until manually deleted)
Use case: Pre-upgrade snapshots, long-term retention
Cost: $0.095/GB-month (same as automated backups after retention period)

Cross-region snapshot copy:

# Automated cross-region copy
aws rds modify-db-instance \
  --db-instance-identifier mydb \
  --backup-retention-period 7 \
  --copy-tags-to-snapshot \
  --enable-cloudwatch-logs-exports '["error","general","slowquery"]'

# Manual cross-region copy
aws rds copy-db-snapshot \
  --source-db-snapshot-identifier arn:aws:rds:us-east-1:123456789012:snapshot:mydb-2024-11-15 \
  --target-db-snapshot-identifier mydb-dr-2024-11-15 \
  --source-region us-east-1 \
  --region us-west-2 \
  --kms-key-id arn:aws:kms:us-west-2:123456789012:key/abc-123

File System Backups

EBS snapshots:

Frequency: Daily/hourly with Data Lifecycle Manager
Retention: 7-30 days typical
Incremental: Only changed blocks stored
Cost: $0.05/GB-month

EFS backups (AWS Backup):

aws backup create-backup-plan \
  --backup-plan '{
    "BackupPlanName": "EFSDaily",
    "Rules": [{
      "RuleName": "DailyBackup",
      "TargetBackupVaultName": "Default",
      "ScheduleExpression": "cron(0 5 * * ? *)",
      "StartWindowMinutes": 60,
      "CompletionWindowMinutes": 120,
      "Lifecycle": {
        "DeleteAfterDays": 30
      }
    }]
  }'

Application Data Backups

S3 versioning + lifecycle policies:

{
  "Rules": [
    {
      "Id": "Archive old versions",
      "Status": "Enabled",
      "Filter": {},
      "NoncurrentVersionTransitions": [
        {
          "NoncurrentDays": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "NoncurrentDays": 90,
          "StorageClass": "GLACIER"
        }
      ],
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 365
      }
    }
  ]
}

DynamoDB backups:

# Enable point-in-time recovery (PITR)
aws dynamodb update-continuous-backups \
  --table-name users \
  --point-in-time-recovery-specification PointInTimeRecoveryEnabled=true

# PITR provides 35-day recovery window
# Cost: No additional charge (included in table pricing)

# On-demand backups for long-term retention
aws dynamodb create-backup \
  --table-name users \
  --backup-name users-backup-2024-11-15

Testing and Validation

Disaster Recovery Testing Cadence

Quarterly full failover tests:

Fail over to DR region
Verify RTO/RPO targets met
Document actual recovery time vs target
Identify gaps, update runbooks

Monthly component tests:

Test database restore from backup
Test AMI launch in DR region
Test Route 53 failover routing
Validate monitoring/alerting

Weekly backup validation:

Verify automated backups completing
Test restore of recent backup (non-production)
Confirm cross-region replication working

DR Test Plan Template

DR Test: Quarterly Failover to us-west-2
Date: 2024-11-15
Participants: Ops team, Engineering leads, Security
Objectives:
  - Verify RTO < 4 hours (target)
  - Verify RPO < 1 hour (target)
  - Test runbook accuracy
  - Identify improvement opportunities

Pre-Test Checklist:
  - [ ] Notify stakeholders (internal users, customers if needed)
  - [ ] Verify backups current (< 24h old)
  - [ ] Confirm DR region resources healthy
  - [ ] Schedule test window (low-traffic period)

Test Steps:
  1. [ ] T+0min: Simulate primary region failure (block health check)
  2. [ ] T+3min: Verify Route 53 failover triggered
  3. [ ] T+5min: Promote RDS read replica in us-west-2
  4. [ ] T+15min: Scale up EC2 Auto Scaling Group
  5. [ ] T+20min: Run smoke tests (critical user journeys)
  6. [ ] T+30min: Verify application fully functional
  7. [ ] T+60min: Monitor error rates, latency, throughput
  8. [ ] T+240min: Failback to primary region

Success Criteria:
  - Application operational in DR region within 4 hours
  - Data loss < 1 hour (verify last transaction timestamp)
  - Error rate < 1% during failover
  - No data corruption detected

Post-Test:
  - Document actual RTO/RPO
  - Update runbooks with lessons learned
  - Create tickets for identified issues
  - Schedule next test

Automated DR Validation

AWS Backup audit:

# Check backup compliance
aws backup list-backup-jobs \
  --by-state COMPLETED \
  --max-results 100 \
  --query 'BackupJobs[?CompletionDate>=`2024-11-01`]'

# Alert if backup older than 24 hours
LAST_BACKUP=$(aws rds describe-db-snapshots \
  --db-instance-identifier mydb \
  --query 'DBSnapshots | sort_by(@, &SnapshotCreateTime)[-1].SnapshotCreateTime' \
  --output text)

HOURS_SINCE_BACKUP=$(( ($(date +%s) - $(date -d "$LAST_BACKUP" +%s)) / 3600 ))

if [ $HOURS_SINCE_BACKUP -gt 24 ]; then
  echo "WARNING: Last backup > 24 hours old"
fi

Replication lag monitoring:

# Aurora Global DB replication lag
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name AuroraGlobalDBReplicationLag \
  --dimensions Name=DBClusterIdentifier,Value=myapp-global \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Maximum

Cost Optimization

DR Cost Comparison

Baseline (single-region): $15,000/month

Strategy	Monthly Cost	Increase	RTO	RPO
Backup & Restore	$15,440	3%	12-24h	1-24h
Pilot Light	$17,400	16%	4-12h	5-60min
Warm Standby	$22,300	49%	1-4h	5-15min
Multi-Site	$32,900	119%	<1h	<1min

Optimization Techniques

1. Tiered storage for backups:

{
  "Rules": [
    {
      "Transitions": [
        {"Days": 30, "StorageClass": "STANDARD_IA"},
        {"Days": 90, "StorageClass": "GLACIER"},
        {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
      ]
    }
  ]
}

Savings: Standard ($0.023/GB) → Deep Archive ($0.00099/GB) = 96% reduction

2. Fargate Spot for DR compute:

Pilot light DR region:

Fargate On-Demand: 10 vCPUs × $0.04/hour = $288/month
Fargate Spot: 10 vCPUs × $0.012/hour = $86/month
Savings: $202/month (70%)

3. Aurora Serverless v2 for DR:

Warm standby with Aurora Serverless v2 (scales down when idle):

Standard Aurora: $1,800/month (always provisioned)
Aurora Serverless v2: $500/month average (scales 0.5-1 ACU most of time)
Savings: $1,300/month (72%)

4. Scheduled scaling for DR region:

Scale down DR region nights/weekends (if acceptable):

Standard: 30 instances × 24h/day × 30 days = 21,600 instance-hours
Scheduled (12h/day weekdays only): 30 instances × 12h × 20 days = 7,200 instance-hours
Savings: 67% compute cost reduction

5. Right-size DR instances:

Production uses m5.2xlarge, DR can use m5.large (half the cost) if acceptable:

Production: 100 × m5.2xlarge × $0.384/hour = $2,765/month
DR: 30 × m5.large × $0.096/hour = $691/month (vs $1,382 if same instance type)
Savings: $691/month (50%)

When to Use Which DR Strategy

Decision Framework

1. Determine acceptable RTO/RPO (business impact analysis):

Financial loss per hour of downtime
Regulatory requirements
Customer expectations (SLA commitments)

2. Calculate DR strategy costs (from cost comparison table)

3. Compare cost vs risk:

Example:

Downtime cost: $100,000/hour
Expected outages: 1 per year (based on historical AWS availability)
Downtime without DR: 24 hours = $2.4M loss

Backup & Restore: $440/month = $5,280/year

Prevents: ~12 hours of downtime (still 12h outage) = Saves $1.2M
ROI: ($1.2M - $5K) / $5K = 23,900% ROI

Warm Standby: $7,300/month = $87,600/year

Prevents: ~22 hours of downtime (2h outage) = Saves $2.2M
ROI: ($2.2M - $88K) / $88K = 2,400% ROI

Multi-Site: $17,900/month = $214,800/year

Prevents: ~23.5 hours of downtime (30min outage) = Saves $2.35M
ROI: ($2.35M - $215K) / $215K = 994% ROI

Recommendation: Warm Standby (best ROI given 1/year outage assumption)

Use Case Matrix

Application Type	RTO Target	RPO Target	Recommended Strategy
Internal dev/test	Days	24h	Backup & Restore
Internal tools	12-24h	4-24h	Backup & Restore
CRM, analytics	4-12h	1-4h	Pilot Light
E-commerce	1-4h	15min-1h	Warm Standby
Financial trading	<1h	<1min	Multi-Site Active
Payment processing	<30min	<1min	Multi-Site Active

Common Pitfalls

1. Not Testing DR Regularly

Problem: DR plan looks good on paper, but never tested. During real disaster, discover RDS snapshot restore takes 8 hours (not 2), ASG launch config references deleted AMI.

Solution: Quarterly full failover tests, document actual RTO/RPO, update runbooks.

2. Forgetting Application Dependencies

Problem: Failover database and compute to DR region, but application calls third-party API hardcoded to us-east-1 (e.g., Redis cache, message queue). Application fails despite infrastructure healthy.

Solution: Inventory all dependencies (databases, caches, queues, external APIs), ensure DR plan includes all components.

3. Insufficient DR Capacity

Problem: Primary region handles 10,000 RPS, DR region provisioned for 3,000 RPS (30% capacity). During failover, DR region overwhelmed, crashes.

Solution: Load test DR region at expected production load, ensure capacity sufficient or ASG configured to scale rapidly.

4. Not Monitoring Replication Lag

Problem: Aurora Global DB normally <1s lag, but during peak load increases to 5 minutes. Failover during high lag = 5 minutes of data loss (exceeds RPO).

Solution: Monitor replication lag, alert if exceeds threshold, understand lag patterns (increases during peak writes).

5. Hardcoded Region in Application

Problem:

S3_BUCKET = "myapp-data-us-east-1"  # ❌ Hardcoded

Application fails over to eu-west-1 but still reads from us-east-1 S3 bucket (cross-region latency or unavailable).

Solution: Use environment variables, detect region dynamically:

REGION = os.environ['AWS_REGION']
S3_BUCKET = f"myapp-data-{REGION}"

6. Ignoring Data Consistency

Problem: Multi-site active-active with DynamoDB Global Tables. Concurrent writes to same item from us-east-1 and eu-west-1 cause last-writer-wins conflict, data loss.

Solution: Design for eventual consistency, use conflict-free data structures (CRDTs), or centralize writes to single region.

7. No Runbook Automation

Problem: DR runbook is 50-page Word document. During real disaster, team scrambles to follow manual steps, makes mistakes, takes 6 hours instead of 2.

Solution: Automate DR failover with scripts/Lambda, regularly test automation, keep runbook as fallback for edge cases.

8. Underestimating DNS TTL Impact

Problem: Route 53 configured with 60s TTL. Health check detects failure in 30s, but clients cached DNS response for 60s. Actual failover time: 90s instead of 30s.

Solution: Lower TTL for critical records (10-30s), accept higher Route 53 query costs.

Key Takeaways

DR Strategies:

Backup & Restore: RTO 12-24h, RPO 1-24h, Cost baseline +3%, use for dev/test and non-critical apps
Pilot Light: RTO 4-12h, RPO 5-60min, Cost +16%, minimal DR infrastructure with continuous DB replication
Warm Standby: RTO 1-4h, RPO 5-15min, Cost +49%, scaled-down DR (20-50% capacity) ready to scale up
Multi-Site Active-Active: RTO <1h, RPO <1min, Cost +119%, full capacity in 2+ regions serving traffic

RTO/RPO Fundamentals:

RTO = Maximum acceptable downtime (time to recover)
RPO = Maximum acceptable data loss (time since last backup)
Aggressive RTO/RPO exponentially increases cost (1h RTO costs 10× vs 24h RTO)

Data Replication:

RDS automated backups: Free, 7-35 day retention, 5-minute RPO with transaction logs
Aurora Global Database: <1s replication lag, <1min failover, read-only secondaries
DynamoDB Global Tables: Multi-master, <1s lag, last-writer-wins conflict resolution
S3 CRR: <15min replication (99.99% SLA with RTC), $0.02-0.035/GB

Cost Optimization:

Tiered storage for backups: Standard → Deep Archive = 96% reduction
Fargate Spot for DR compute: 70% discount (accept interruption risk)
Aurora Serverless v2 for DR: 72% savings (scales down when idle)
Right-size DR instances: Use smaller instances in DR (m5.2xlarge → m5.large) = 50% savings

Testing:

Quarterly full failover tests (verify RTO/RPO targets met)
Monthly component tests (database restore, AMI launch, Route 53 failover)
Weekly backup validation (verify automated backups completing)
Automate DR testing with scripts/Lambda (manual runbooks error-prone)

Common Pitfalls:

Not testing DR regularly (discover issues during real disaster)
Forgetting application dependencies (database fails over but Redis cache doesn’t)
Insufficient DR capacity (DR region overwhelmed during failover)
Hardcoding region in application code (cross-region calls after failover)
Ignoring replication lag (failover during high lag exceeds RPO)

Decision Framework:

Calculate downtime cost per hour (revenue loss + customer loss + reputation)
Determine acceptable RTO/RPO (business impact analysis)
Select DR strategy (cost vs risk analysis, ROI calculation)
Test quarterly, document actual RTO/RPO vs target
Continuously improve (update runbooks, automate, reduce costs)

When to Use:

Backup & Restore: Dev/test, internal tools, can tolerate 12-24h RTO, cost-sensitive
Pilot Light: Business-critical apps, 4-12h RTO acceptable, budget-conscious
Warm Standby: Customer-facing apps, 1-4h RTO target, can justify 50% cost increase
Multi-Site Active-Active: Mission-critical apps, cannot tolerate downtime, global users, 2× cost acceptable

The right DR strategy balances cost, complexity, and risk tolerance based on business requirements and acceptable downtime/data loss.

Table of Contents

What Problems Disaster Recovery Solves

RTO and RPO Fundamentals

Definitions

Cost-Complexity Spectrum

Business Impact Analysis

DR Strategy Selection Framework

Decision Matrix

Backup and Restore

Architecture

Implementation

Recovery Process

Cost

Pilot Light

Architecture

Implementation

Recovery Process

Cost

Warm Standby

Architecture

Implementation

Recovery Process

Cost

Multi-Site Active-Active

Architecture

Implementation

Recovery Process

Cost

Data Backup Strategies

Database Backups

File System Backups

Application Data Backups

Testing and Validation

Disaster Recovery Testing Cadence

DR Test Plan Template

Automated DR Validation

Cost Optimization

DR Cost Comparison

Optimization Techniques

When to Use Which DR Strategy

Decision Framework

Use Case Matrix

Common Pitfalls

1. Not Testing DR Regularly

2. Forgetting Application Dependencies

3. Insufficient DR Capacity

4. Not Monitoring Replication Lag

5. Hardcoded Region in Application

6. Ignoring Data Consistency

7. No Runbook Automation

8. Underestimating DNS TTL Impact

Key Takeaways