EC2 for System Architects
Table of Contents
- What Problems EC2 Solves
- Instance Type Families and Selection
- AWS Graviton Processors
- AWS Nitro System
- Cost Optimization Strategies
- Right-Sizing with Compute Optimizer
- Security Best Practices
- High Availability and Reliability
- Storage Options
- Service Comparison: EC2 vs Lambda vs Containers
- Common Pitfalls
- Key Takeaways
What Problems EC2 Solves
Amazon EC2 (Elastic Compute Cloud) provides resizable compute capacity in the cloud, addressing fundamental infrastructure challenges that organizations face:
Infrastructure challenges solved:
- Eliminates upfront hardware investment: No need to purchase and maintain physical servers, data center space, or supporting infrastructure
- Rapid scaling: Launch instances in minutes and scale capacity up or down as requirements change
- Pay-per-use model: Only pay for compute time consumed, measured in instance seconds
- Complete control: Full control over operating system, networking, storage configuration, and security settings
- Global availability: Deploy across multiple geographic regions and availability zones for high availability and low latency
- Flexibility: Wide variety of instance types optimized for different workload characteristics
EC2 provides the foundation for compute workloads on AWS, offering the control of traditional servers with the flexibility and economics of cloud computing.
Instance Type Families and Selection
Instance Naming Convention
EC2 instance names follow a structured convention: family + generation + processor family + capabilities + size
Example: m7g.xlarge
m= instance family (general purpose)7= generationg= Graviton processor (ARM-based)xlarge= size
Instance Type Families
General Purpose (M, T series)
Balanced compute, memory, and networking resources suitable for most workloads.
M-series (e.g., M5, M6i, M7g):
- Consistent performance for steady-state workloads
- Balance of compute, memory, and network resources
- Good default choice when workload characteristics are unknown
T-series (e.g., T3, T4g):
- Burstable performance instances that accumulate CPU credits during idle periods
- Cost-effective for workloads with variable usage patterns
- Ideal for development environments, code repositories, microservices
When to use:
- Web servers and application servers
- Small-to-medium databases
- Development and testing environments
- Code repositories and build servers
- Microservices with variable load
Compute Optimized (C series)
High-performance processors with the highest CPU-to-memory ratio.
Examples: C5, C6i, C7g
When to use:
- Batch processing workloads
- Media transcoding and video encoding
- High-performance web servers
- Scientific modeling and simulations
- Machine learning inference
- Gaming servers
- Ad serving engines
Trade-offs: Higher compute capacity means higher cost per hour compared to general purpose instances with similar memory.
Memory Optimized (R, X, U series)
High memory-to-CPU ratio for memory-intensive applications.
R-series (e.g., R5, R6i, R7g):
- Standard memory-optimized instances
- Good for in-memory databases and caching
X-series (e.g., X2):
- Higher memory capacity (up to 4 TB per instance)
- Optimized for memory-intensive enterprise applications
U-series:
- Ultra-high memory (up to 24 TB per instance)
- Purpose-built for large in-memory databases like SAP HANA
When to use:
- In-memory databases (Redis, Memcached)
- Real-time big data analytics
- High-performance relational databases
- SAP HANA and other in-memory ERP systems
- Large-scale data processing and ETL workloads
Storage Optimized (D, I, Im, Is series)
High sequential read/write access to very large datasets on local storage.
D-series (e.g., D3):
- Dense HDD storage for distributed file systems
I-series (e.g., I4i):
- NVMe SSD storage for low-latency workloads
Im/Is-series (e.g., Im4gn, Is4gen):
- AWS Graviton processors with high-performance local storage
When to use:
- NoSQL databases (Cassandra, MongoDB, ScyllaDB)
- Data warehousing applications
- Distributed file systems (Hadoop HDFS, MapReduce)
- Log processing and analytics
- Elasticsearch and OpenSearch clusters
Important consideration: These instances use instance store (ephemeral) storage. Data is lost when the instance stops or terminates. Implement replication and backup strategies accordingly.
Accelerated Computing (P, G, Inf, Trn series)
Hardware accelerators (GPUs, FPGAs, inference processors) for specialized workloads.
P-series (e.g., P4, P5):
- NVIDIA GPUs for machine learning training and HPC
G-series (e.g., G5):
- NVIDIA GPUs for graphics-intensive applications and machine learning inference
Inf-series (e.g., Inf2):
- AWS Inferentia chips optimized for machine learning inference
Trn-series (e.g., Trn1):
- AWS Trainium chips optimized for deep learning training
When to use:
- Machine learning model training and inference
- High-performance computing (HPC) simulations
- Graphics rendering and video processing
- Financial modeling and risk analysis
- Genomics research and computational chemistry
Selection Criteria
Choose instance types based on these key factors:
1. Workload characteristics: Identify whether your application is CPU-bound, memory-bound, I/O-bound, or requires specialized accelerators.
2. Performance requirements: Consider throughput, latency, IOPS, and network bandwidth needs.
3. Cost constraints: Balance performance needs with budget. Higher-performance instances cost more per hour.
4. Operating system compatibility: Graviton (ARM) instances only support Linux operating systems. Windows workloads require x86-based instances.
5. Regional availability: Newer instance types may not be available in all AWS regions and availability zones.
Best practice: AWS recommends testing instance types with benchmark applications. With per-second billing, experimentation is cost-effective. Launch test instances, run representative workloads, and measure actual performance before committing to long-term purchases.
AWS Graviton Processors
What is Graviton?
AWS Graviton processors are custom 64-bit ARM Neoverse cores designed by AWS for optimized cloud performance. Graviton instances are identifiable by the letter “g” in the processor family (e.g., m7g, c7g, r7g).
Performance and Cost Benefits
AWS Graviton delivers significant advantages over comparable x86-based instances:
Cost and performance improvements:
- Up to 40% better price-performance compared to equivalent x86-based instances
- Up to 60% less energy consumption for equivalent workloads
- Graviton2 improvements: 7x more performance, 4x more compute cores, 5x faster memory, 2x larger caches compared to first-generation Graviton
- Graviton4 improvements (latest generation): Up to 40% performance enhancement and 29% price-performance improvement over Graviton3
Real-World Results
Organizations report substantial benefits from Graviton adoption:
- National Australia Bank: Saving an estimated $1 million per month
- Pix4D: Reduced infrastructure costs by 20%
- OpenSearch workloads: 38% improvement in indexing throughput, 50% reduction in indexing latency, 40% improvement in query performance
- T4g vs T3 instances: Up to 40% more efficient performance
When to Use Graviton
Ideal for:
- Linux-based workloads (containers, web applications, microservices)
- Open-source software with ARM64 support
- Cost-sensitive workloads where 40% savings matter significantly
- Sustainability initiatives (60% lower energy consumption)
- Container-based architectures using multi-architecture images
Important considerations:
- Linux only: Graviton processors only support Linux operating systems (Amazon Linux 2, Ubuntu, RHEL, etc.)
- ARM compatibility required: Applications and dependencies must be ARM64-compatible
- Container images: Build multi-architecture container images (ARM64 and AMD64) for maximum flexibility
- Built on Nitro System: All Graviton instances leverage AWS Nitro System benefits
Best practice: For new Linux workloads, default to Graviton instances unless you have specific x86 dependencies. The cost savings and performance improvements typically outweigh any migration effort.
AWS Nitro System
What is Nitro?
The AWS Nitro System is a collection of building blocks that offloads traditional virtualization functions to dedicated hardware and software. Nitro provides the foundation for all modern EC2 instances.
Architecture Components
Nitro Cards:
- Dedicated hardware for VPC networking
- EBS storage connectivity
- Instance storage (NVMe)
- Controller functions
Nitro Security Chip:
- Hardware-based security
- Integrated into the motherboard
- Continuous monitoring and protection
Nitro Hypervisor:
- Lightweight hypervisor
- Memory and CPU allocation
- Minimal overhead
Key Benefits
Performance: Nitro delivers practically all compute and memory resources of the host hardware to instances. Traditional hypervisors reserve 10-20% of resources for virtualization overhead. Nitro reduces this to near zero.
Security:
- Minimized attack surface: Offloading functions to dedicated hardware reduces the hypervisor attack surface
- Locked-down security model: Prohibits all administrative access, including by AWS employees
- Hardware-based protection: Eliminates possibility of human error, misconfiguration, or insider threats
- Continuous attestation: Nitro Security Chip continuously validates system integrity
Networking: High-speed networking (up to 400 Gbps) and EBS bandwidth (up to 80 Gbps) delivered via dedicated Nitro Cards.
Enhanced capabilities: Support for significantly more IP addresses per instance, larger instance sizes, and bare-metal instances.
AWS Nitro Enclaves
Nitro Enclaves provide isolated compute environments within EC2 instances for processing highly sensitive data.
Characteristics:
- Fully isolated: Separate virtual machines with no persistent storage, no interactive access (no SSH/RDP), no external networking
- Cryptographic attestation: Verify enclave identity and that only authorized code runs inside
- AWS KMS integration: Decrypt data inside the enclave using KMS with attestation-based access
- Low-latency communication: High-throughput communication between parent instance and enclave via local vsock sockets
- No additional cost: Only pay for the EC2 instance and associated AWS services
Use cases:
- Processing personally identifiable information (PII)
- Healthcare data processing (HIPAA compliance)
- Financial data and payment card information (PCI DSS)
- Intellectual property protection
- Digital rights management (DRM)
- Multi-party computation scenarios
How it works: Nitro Enclaves allocate CPU cores and memory from the parent EC2 instance, creating an isolated environment. Even root or administrator users on the parent instance cannot access the enclave. The enclave uses cryptographic attestation to prove to AWS KMS (or other services) that only authorized code is running, enabling secure key access.
Best practice: Use Nitro Enclaves when processing sensitive data that requires cryptographic proof of isolation. The attestation mechanism provides stronger security guarantees than traditional instance-level isolation.
Cost Optimization Strategies
Pricing Models Overview
AWS offers four distinct pricing models, each tailored to different usage patterns:
| Pricing Model | Discount | Commitment | Flexibility | Best For |
|---|---|---|---|---|
| On-Demand | 0% (baseline) | None | Maximum | Unpredictable workloads, testing |
| Savings Plans | Up to 72% | 1 or 3 years | High | Mixed workloads, evolving architectures |
| Reserved Instances | Up to 72% | 1 or 3 years | Moderate | Predictable workloads, specific configurations |
| Spot Instances | Up to 90% | None | Low (can be interrupted) | Fault-tolerant, flexible workloads |
Savings Plans (AWS Recommended)
AWS recommends Savings Plans over Reserved Instances as “the easiest and most flexible way to save money on AWS compute.”
Compute Savings Plans
Discount: Up to 66% off On-Demand pricing
Flexibility: Applies automatically to:
- EC2 instances (any family, size, region, OS, tenancy)
- AWS Fargate
- AWS Lambda
- Dedicated Hosts
How it works: Commit to a specific dollar amount per hour (e.g., $10/hour) for 1 or 3 years. AWS automatically applies the discount to your compute usage up to the commitment amount.
When to use: Best for organizations with workloads that vary across services, instance types, or regions. Ideal when your architecture is evolving and you need maximum flexibility.
EC2 Instance Savings Plans
Discount: Up to 72% off On-Demand pricing
Flexibility: Tied to a specific instance family within a region, but flexible on:
- Instance size within the family
- Operating system
- Tenancy (shared or dedicated)
Example: Commit to M5 instances in us-east-1. You can switch between m5.large, m5.xlarge, m5.2xlarge, and change between Linux and Windows.
When to use: Best for predictable EC2 workloads where you know the instance family but may need to adjust sizes or operating systems.
Reserved Instances
Standard Reserved Instances
Discount: Up to 72% off On-Demand pricing (matching EC2 Instance Savings Plans)
Flexibility:
- Can modify instance size within the same family
- Cannot change instance family, region, or OS
- Can be sold on AWS Reserved Instance Marketplace
When to use: Best for fixed, high-utilization workloads where resale flexibility on the marketplace matters. Also useful when you need instance reservations for capacity planning in specific availability zones.
Convertible Reserved Instances
Discount: 31-54% off On-Demand pricing (lower than Standard RIs)
Flexibility:
- Can be exchanged for different instance types, families, OS, or tenancy
- Cannot be sold on the marketplace
When to use: Best for predictable workloads that may need configuration changes over the commitment period. Trade lower discount for exchange flexibility.
Spot Instances
Discount: Up to 90% off On-Demand pricing
How it works: Use spare AWS capacity that can be reclaimed with a 2-minute warning when AWS needs the capacity back.
Reliability: AWS data from March 2024 shows only 5% of Spot instances were interrupted in the previous three months. Interruptions are infrequent, not constant.
When to use:
- Fault-tolerant applications that can handle interruptions
- Batch processing workloads
- Data analysis and ETL jobs
- CI/CD build systems
- Containerized workloads with orchestration (Kubernetes, ECS)
- Stateless web services behind load balancers
When NOT to use:
- Databases or stateful applications without replication
- Real-time applications requiring guaranteed availability
- Workloads that cannot checkpoint and resume
Best practice: Combine Spot instances with On-Demand instances in Auto Scaling Groups. Use Spot for the baseline capacity and On-Demand for peak demand, or vice versa depending on your risk tolerance.
Industry Insights and Recommendations
2024 adoption data:
- 53% of organizations use no commitment discounts, leaving significant savings on the table
- 38% use Savings Plans vs. 18% use Reserved Instances (2x adoption rate)
- Organizations increasingly prefer Savings Plans for flexibility over Reserved Instances
Cost optimization strategy:
- Analyze before committing: Use AWS Cost Explorer and Compute Optimizer to understand your usage patterns
- Right-size first: Ensure instances are appropriately sized before purchasing commitments
- Layer pricing models:
- Compute Savings Plans: Cover baseline compute across all services (EC2, Lambda, Fargate)
- EC2 Instance Savings Plans or RIs: Cover predictable EC2 workloads in specific families
- Spot Instances: Use for fault-tolerant, flexible workloads
- On-Demand: Fill the gap for unpredictable demand
- Start conservative: Purchase commitments covering 60-70% of baseline usage, not 100%
- Leverage Consolidated Billing: Apply Savings Plans across multiple accounts in AWS Organizations
- Review quarterly: Adjust commitments as usage patterns change
Example strategy for a typical organization:
- 60% coverage with Compute Savings Plans (flexibility across services)
- 20% coverage with EC2 Instance Savings Plans for known workloads
- 15% using Spot Instances for batch processing and testing
- 5% On-Demand for unpredictable bursts
This approach balances cost savings (up to 72% on committed capacity, up to 90% on Spot) with flexibility to adapt to changing needs.
Right-Sizing with Compute Optimizer
What is AWS Compute Optimizer?
AWS Compute Optimizer is a free service that analyzes CloudWatch metrics to generate rightsizing recommendations for EC2 instances, Auto Scaling Groups, and EBS volumes.
How it works:
- Analyzes CloudWatch metrics over the last 14 days
- Uses machine learning to identify optimization opportunities
- Generates recommendations for instance type/size changes
- Refreshes recommendations daily
- Supports organizational, account, or regional-level configuration
March 2024 Enhancement: Memory Utilization
AWS added customizable EC2 rightsizing recommendations based on memory utilization, not just CPU metrics.
Why it matters: Previous recommendations only considered CPU, network, and disk metrics. Many applications are memory-bound, not CPU-bound. Memory-aware recommendations prevent undersizing memory-intensive workloads.
How to enable:
- Install CloudWatch Agent on EC2 instances to collect memory metrics
- Alternatively, integrate third-party observability tools (Datadog, New Relic, Dynatrace)
- Configure Compute Optimizer to use memory utilization in recommendations
Customization Options
Compute Optimizer offers four preset recommendation preferences:
| Preset | CPU Threshold | CPU Headroom | Memory Headroom | Use Case |
|---|---|---|---|---|
| AWS default | P99.5 | 10% | 10% | Balanced approach |
| AWS optimized | P90 | 15% | 15% | Prioritize stability |
| Maximum savings | P90 | 0% | 10% | Minimize cost |
| Custom | Configurable | Configurable | Configurable | Specific needs |
Headroom: Reserved capacity above observed utilization to handle traffic spikes. 10% memory headroom means if your instance uses 8 GB on average, Compute Optimizer recommends at least 8.8 GB capacity.
Right-Sizing Best Practices
1. Right-size before purchasing commitments: Avoid locking in oversized instances with Reserved Instances or Savings Plans. Right-size first, then commit.
2. Enable memory metrics: Install CloudWatch Agent to get memory-aware recommendations, not just CPU-based suggestions.
3. Use Auto Scaling Groups even for single instances: ASGs provide free monitoring, automatic replacement on failure, and easier management. Compute Optimizer provides recommendations for ASG configurations.
4. Review recommendations regularly: Check Compute Optimizer weekly or monthly. Usage patterns change as applications evolve.
5. Test before implementing: Recommendations are suggestions, not guarantees. Test recommended instance types with production workloads before making changes.
6. Consider fewer, larger instances: In Kubernetes and containerized environments, fewer larger instances reduce scheduler and API server overhead compared to many small instances.
7. Tag resources properly: Use consistent tagging to group related resources and track cost allocation. Makes it easier to identify candidates for rightsizing.
8. Set up cost alerts: Configure AWS Cost Anomaly Detection and budget alerts to catch unexpected cost increases from oversized instances.
Example Scenario
Current state: Running 10 x m5.2xlarge instances (8 vCPUs, 32 GB RAM each) for a web application.
Observed metrics: Average CPU utilization 15%, average memory utilization 40%.
Compute Optimizer recommendation: Downsize to m5.large (2 vCPUs, 8 GB RAM) based on utilization patterns.
Analysis:
- m5.2xlarge: $0.384/hour × 10 instances = $3.84/hour = $2,765/month
- m5.large: $0.096/hour × 10 instances = $0.96/hour = $691/month
- Savings: $2,074/month (75% reduction) while maintaining adequate headroom
Action: Gradually migrate instances to m5.large, monitor performance, and adjust if needed. Once validated, purchase Savings Plans or Reserved Instances for the right-sized configuration.
Security Best Practices
IMDSv2 (Instance Metadata Service Version 2)
Why IMDSv2 is Critical
The Instance Metadata Service (IMDS) provides EC2 instances with configuration information, including temporary IAM role credentials. IMDSv1 uses a simple HTTP request that is vulnerable to Server-Side Request Forgery (SSRF) attacks.
The vulnerability: If your application has an SSRF vulnerability, an attacker can trick the application into making a request to http://169.254.169.254/latest/meta-data/iam/security-credentials/role-name and steal IAM credentials.
IMDSv2 solution: Requires a session token for metadata requests, preventing SSRF exploitation.
How IMDSv2 Works
Session-oriented security:
- Request a session token with a PUT request including a TTL header
- Use the session token in subsequent metadata requests
- Tokens have limited duration (1 second to 6 hours)
- PUT requests cannot be forwarded by typical proxy or web application layers
Why this prevents SSRF: Most SSRF attacks can only control the URL in a GET request. IMDSv2 requires a PUT request first, which is not possible in typical SSRF scenarios.
How to Enforce IMDSv2
In launch templates (recommended approach):
{
"MetadataOptions": {
"HttpTokens": "required",
"HttpPutResponseHopLimit": 1,
"HttpEndpoint": "enabled"
}
}
Key settings:
HttpTokens: required– Enforces IMDSv2, disables IMDSv1HttpPutResponseHopLimit: 1– Limits token to single network hop (prevents forwarding)HttpEndpoint: enabled– Keeps IMDS available
In CloudFormation:
MetadataOptions:
HttpTokens: required
HttpPutResponseHopLimit: 1
Monitoring and Compliance
Use AWS Config rule ec2-imdsv2-check to verify all instances enforce IMDSv2:
- Rule returns
NON_COMPLIANTifHttpTokensis set tooptional(allows IMDSv1) - Rule returns
COMPLIANTifHttpTokensis set torequired(enforces IMDSv2)
Best practice: Implement organizational policy requiring IMDSv2 for all new instances. Use AWS Config for continuous compliance monitoring and automated remediation.
IAM Roles for EC2
Use IAM Roles Instead of Access Keys
Never store AWS access keys on EC2 instances. Use IAM roles instead.
Why IAM roles are better:
- Temporary credentials: Automatically rotated every few hours
- No credential storage: No risk of credentials in code, configuration files, or AMIs
- Automatic retrieval: AWS SDK automatically retrieves credentials from IMDS
- Easier management: Change permissions by updating the role, not by rotating keys across instances
Apply Least Privilege
Grant only the minimum permissions required for the instance to function.
Bad practice:
{
"Effect": "Allow",
"Action": "*",
"Resource": "*"
}
Good practice:
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::my-app-bucket/*"
}
Best practice: Start with no permissions. Add only what the application needs as you discover requirements. Use AWS CloudTrail and IAM Access Analyzer to identify unused permissions.
Restrict Credential Usage
Use AWS global condition context keys to restrict where IAM role credentials can be used:
Restrict to specific VPC:
{
"Condition": {
"StringEquals": {
"aws:EC2InstanceSourceVPC": "vpc-12345678"
}
}
}
Restrict to specific private IP addresses:
{
"Condition": {
"IpAddress": {
"aws:EC2InstanceSourcePrivateIPv4": ["10.0.1.0/24"]
}
}
}
Why this matters: If an attacker exfiltrates temporary credentials from an EC2 instance, these conditions prevent using the credentials from outside the instance or VPC. Credentials only work when used from the issuing instance.
Best practice: Combine IMDSv2 (prevent credential theft) with VPC/IP restrictions (limit damage if credentials are stolen).
Security Groups vs. NACLs
Security Groups (Instance-Level, Stateful)
Characteristics:
- Operate at the instance/ENI level
- Stateful: Return traffic automatically allowed
- Only allow rules (implicit deny all)
- All rules evaluated together (not ordered)
Best practices:
- Create custom security groups with least permissive rules
- Use descriptive names indicating purpose (e.g., “web-server-sg”, “database-sg”)
- Reference other security groups in rules instead of IP addresses when possible
- Regularly audit rules to remove unnecessary access
- Never allow 0.0.0.0/0 (all internet) for databases or internal services
Example architecture:
- Web tier security group: Allow 443 from 0.0.0.0/0, allow 22 from bastion security group
- App tier security group: Allow 8080 from web tier security group
- Database security group: Allow 3306 from app tier security group
Network ACLs (Subnet-Level, Stateless)
Characteristics:
- Operate at the subnet level
- Stateless: Must explicitly allow both inbound and outbound traffic
- Support allow and deny rules
- Processed in order by rule number (lowest first)
When to use NACLs:
- Block specific IP addresses at the subnet boundary (security groups only support allow rules)
- Add defense-in-depth at the subnet level
- Implement subnet-level isolation between environments
Best practice: Use security groups as the primary access control mechanism (instance-specific, easier to manage). Use NACLs for subnet-level defense-in-depth and explicit deny requirements.
Additional Security Best Practices
Encrypt Data
EBS volumes:
- Enable EBS encryption by default at the account level
- Use AWS KMS for key management
- Encrypted snapshots create encrypted volumes
- Minimal performance impact
Data in transit:
- Use TLS/SSL for all network communication
- Terminate TLS at load balancers or on instances
- Use AWS Certificate Manager for free, auto-renewing certificates
Patch Management
Use AWS Systems Manager Patch Manager:
- Automate operating system and application patching
- Define maintenance windows for updates
- Track patch compliance across instances
- Reduces window of vulnerability exposure
Best practice: Configure automated patching for security updates during maintenance windows. Test patches in non-production environments first.
Secret Management
Never store secrets in plain text:
- Use AWS Secrets Manager for database credentials, API keys, and other secrets
- Use AWS Systems Manager Parameter Store for configuration data
- Never hardcode secrets in application code, configuration files, or AMIs
- Use IAM roles to grant instances access to specific secrets
Example: Application retrieves database credentials from Secrets Manager at startup using the instance’s IAM role, instead of storing credentials in a config file.
AMI Management
Security considerations:
- Regularly audit AMI permissions (ensure AMIs aren’t public unless intended)
- Encrypt volumes before creating AMIs to prevent data exposure
- Remove sensitive data before creating AMIs
- Patch base AMIs regularly and replace instances with updated AMIs
- Use golden AMI pipelines with automated security scanning
Monitoring and Auditing
Enable comprehensive logging:
- AWS CloudTrail: Log all API calls for auditing
- VPC Flow Logs: Capture network traffic for analysis
- CloudWatch Logs: Centralize application and system logs
- AWS GuardDuty: Automated threat detection using machine learning
Best practice: Store logs in a separate security account with restricted access. Use automated analysis for anomaly detection.
High Availability and Reliability
Multi-AZ Deployments
Understanding Availability Zones
Availability Zones (AZs) are distinct locations within an AWS Region:
- Engineered to be isolated from failures in other AZs
- Connected via low-latency, high-throughput, redundant networking
- Each Region has multiple AZs (minimum 3, typically 3-6)
- AZ failures are rare but happen (power outages, network issues, natural disasters)
Why Multi-AZ Matters
Single AZ risk: If your application runs in a single AZ and that AZ becomes unavailable, your entire application goes down.
Multi-AZ benefits:
- Protection against AZ-level failures
- Enables zero-downtime maintenance (move traffic away from AZ during updates)
- Required for Route 53 Health Check DNS failover
- Better load distribution
Best practice: Always deploy across at least two Availability Zones. For critical workloads, use three AZs for additional resilience.
Auto Scaling Groups
Why Use Auto Scaling Groups
Auto Scaling Groups (ASGs) provide automated instance management, not just scaling:
Key benefits:
- Health monitoring: Continuously checks instance health
- Automatic replacement: Terminates unhealthy instances and launches replacements
- Distribution: Automatically distributes instances across multiple AZs
- Scaling: Adjusts capacity based on demand, schedules, or predictive patterns
- Load balancer integration: Registers/deregisters instances automatically
Best practice: Every EC2 instance should be in an Auto Scaling Group, even single instances. ASGs are free, and the operational benefits (automatic replacement, easier management) far outweigh any complexity.
Launch Templates (Required for New Accounts)
Important change: Accounts created after October 1, 2024 cannot create launch configurations. Launch templates are now mandatory.
Launch Templates vs. Launch Configurations:
| Feature | Launch Templates | Launch Configurations |
|---|---|---|
| Versioning | Yes | No |
| New instance types | Supported | Limited |
| Mixed instance types | Yes | No |
| Spot + On-Demand | Yes | No |
| Override per ASG | Yes | No |
| Status | Current standard | Deprecated |
Launch template benefits:
- Version control: Track changes and roll back if needed
- Support for all new EC2 features and instance types
- Mix instance types and purchase options in a single ASG
- Override specific settings per Auto Scaling Group
- JSON or YAML configuration
Migration recommendation: If you’re using launch configurations, migrate to launch templates now. Use the AWS CLI or console to create templates from existing configurations.
Auto Scaling Strategies
1. Predictive Scaling (recommended for regular patterns):
- Uses machine learning to analyze historical traffic patterns
- Proactively schedules instances before demand spikes
- Ideal for workloads with daily, weekly, or seasonal patterns
- Reduces latency by having capacity ready before needed
Example: E-commerce site sees traffic spike every day at 9 AM. Predictive scaling launches instances at 8:45 AM.
2. Dynamic Scaling:
Target tracking (recommended for most workloads):
- Maintains a specific metric target (e.g., 50% CPU utilization)
- Automatically calculates how many instances needed
- Simple to configure and manage
Step scaling:
- Adds/removes capacity based on CloudWatch alarm thresholds
- Different scaling amounts for different threshold breaches
- More control than target tracking
Simple scaling:
- Single adjustment based on alarm state
- Waits for cooldown period before additional scaling
- Less responsive than step scaling
3. Scheduled Scaling:
- Scales based on known time-based patterns
- Useful for predictable daily/weekly traffic changes
- Can combine with dynamic scaling
Example: Scale up Monday-Friday 8 AM - 6 PM, scale down at night and weekends.
Auto Scaling Best Practices
Enable detailed monitoring: Use 1-minute CloudWatch metrics instead of 5-minute metrics for faster scaling response to demand changes.
Combine multiple policies: Use predictive scaling for baseline capacity, target tracking for unexpected spikes, and scheduled scaling for known patterns.
Use placement groups: For latency-sensitive applications, use placement groups (cluster, partition, or spread) to control instance placement for better network performance.
Implement warm pools: For applications with long startup times (several minutes), use warm pools to maintain pre-initialized instances that can join the ASG quickly.
Use instance refresh: When updating launch templates, use instance refresh to gradually replace instances across the ASG, maintaining availability during updates.
Set appropriate cooldown periods: Prevent thrashing (rapid scaling up and down) by setting cooldown periods. Default is 300 seconds (5 minutes).
Load Balancer Integration
Why Integrate ASGs with Load Balancers
Load balancers distribute traffic across healthy instances, providing:
- Automatic traffic distribution
- Health checking
- SSL/TLS termination
- Connection draining during scale-in
- Cross-AZ load balancing
Load Balancer Types
Application Load Balancer (ALB):
- Operates at Layer 7 (HTTP/HTTPS)
- Advanced routing (path-based, host-based, header-based)
- WebSocket and HTTP/2 support
- Best for web applications and microservices
Network Load Balancer (NLB):
- Operates at Layer 4 (TCP/UDP/TLS)
- Ultra-low latency (microseconds)
- Static IP addresses and Elastic IPs
- Millions of requests per second
- Best for TCP/UDP workloads, gaming, IoT
Gateway Load Balancer (GWLB):
- Operates at Layer 3 (Network layer)
- Distributes traffic to third-party virtual appliances
- Best for firewalls, intrusion detection, deep packet inspection
Health Check Best Practices
Implement deep health checks, not shallow health checks.
Shallow health check problem:
# Bad: Only checks if web server responds
Health check: GET / => returns 200 OK
If the web server is running but the database is unavailable, the health check passes but the application returns errors to users. The load balancer continues sending traffic to a failing instance.
Deep health check solution:
# Good: Checks critical dependencies
Health check: GET /health => checks:
- Database connectivity
- Cache connectivity
- Critical API dependencies
- Returns 200 only if all dependencies are healthy
Deep health check benefits:
- Detects gray failures (partial functionality)
- Removes instances from rotation when dependencies fail
- Prevents routing traffic to instances that will return errors
- Combined with ASG, triggers automatic replacement
Health check configuration:
- Set appropriate interval (default 30 seconds)
- Set healthy/unhealthy thresholds (how many consecutive checks before changing state)
- Configure timeout less than interval
- Monitor health check failures in CloudWatch
DNS Failover with Route 53
When integrated with Route 53 Health Checks:
- Route 53 monitors load balancer endpoint health in each AZ
- If all targets in an AZ fail, Route 53 marks that AZ’s load balancer IP as unhealthy
- Removes unhealthy IPs from DNS responses
- Routes traffic only to healthy AZs
- Provides automatic regional failover
Best practice: Configure Route 53 health checks for load balancers in multi-AZ deployments. This provides DNS-level failover in addition to load balancer-level traffic distribution.
Backup and Recovery
EBS Snapshots
What are snapshots:
- Point-in-time backups of EBS volumes
- Incremental (only changed blocks stored after first snapshot)
- Stored in Amazon S3 with high durability
- Can create new volumes from snapshots in any AZ within the same Region
Best practices:
- Automate snapshots using AWS Backup or lifecycle policies
- Take snapshots during low-usage periods to minimize performance impact
- Tag snapshots with metadata (application, environment, date)
- Test restore procedures regularly
- Encrypt snapshots for sensitive data
- Use cross-region snapshot copying for disaster recovery
Amazon Machine Images (AMIs)
What are AMIs:
- Templates including boot volume, launch configuration, and block device mapping
- Saves entire instance configuration
- Faster recovery than rebuilding from scratch
Best practices:
- Create golden AMIs with pre-configured software and security hardening
- Build AMI pipelines with automated security scanning
- Remove sensitive data before creating AMIs
- Test AMI launches regularly
- Use AMI encryption to prevent data exposure
- Deprecate old AMIs to prevent accidental use of outdated configurations
Disaster Recovery Strategies
Multi-AZ (high availability, same region):
- Protects against AZ-level failures
- RTO: Minutes, RPO: Near-zero
- Cost: Moderate (running resources in multiple AZs)
Multi-Region (disaster recovery, different regions):
- Protects against region-level failures
- RTO: Minutes to hours, RPO: Minutes
- Cost: Higher (duplicated resources across regions)
- Requires cross-region replication for data
Best practice: For critical workloads, implement multi-AZ for high availability and multi-region for disaster recovery. Use automated failover mechanisms to minimize recovery time.
Storage Options
EC2 instances can use three types of storage, each with distinct characteristics and use cases.
EBS (Elastic Block Store)
Characteristics
- Block-level storage volumes that behave like physical hard drives
- Persistent storage (data survives instance stop/termination when configured)
- Automatically replicated within its Availability Zone for durability
- Can only be attached to one instance at a time (except io2 multi-attach volumes)
- Must be in the same AZ as the EC2 instance
- Can be detached from one instance and attached to another
EBS Volume Types
General Purpose SSD (gp3, gp2):
- Balanced price and performance
- gp3: 3,000 IOPS baseline, up to 16,000 IOPS, 125-1,000 MB/s throughput
- gp2: IOPS scale with volume size (3 IOPS per GB, up to 16,000 IOPS)
- Best for: Boot volumes, development/test, small-to-medium databases, virtual desktops
Provisioned IOPS SSD (io2 Block Express, io2, io1):
- Highest performance SSD for mission-critical workloads
- io2 Block Express: Up to 256,000 IOPS and 4,000 MB/s throughput
- io2: Up to 64,000 IOPS, 99.999% durability
- io1: Up to 64,000 IOPS, 99.8-99.9% durability
- Best for: Large relational/NoSQL databases, latency-sensitive workloads, business-critical applications
Throughput Optimized HDD (st1):
- Low-cost HDD for frequently accessed, throughput-intensive workloads
- Up to 500 MB/s throughput
- Cannot be used as boot volumes
- Best for: Big data, data warehouses, log processing, streaming workloads
Cold HDD (sc1):
- Lowest-cost HDD for less frequently accessed workloads
- Up to 250 MB/s throughput
- Cannot be used as boot volumes
- Best for: Infrequently accessed data, archival storage, cold data requiring sequential reads
When to Use EBS
Ideal for:
- Boot volumes for EC2 instances
- Relational and NoSQL databases requiring persistent storage
- Applications requiring consistent, low-latency storage
- Workloads needing snapshots for backup/recovery
- Block-level storage for single instances
- Data that must survive instance stop/termination
Performance considerations:
- Use EBS-optimized instances for dedicated bandwidth between instance and EBS
- Choose appropriate volume type based on IOPS/throughput requirements
- Use io2 for latency-sensitive, high-IOPS workloads (databases)
- Use gp3 for general-purpose workloads (better price-performance than gp2)
EBS Best Practices
Encryption: Enable EBS encryption by default at the account level. Minimal performance impact.
Backups: Automate snapshots using AWS Backup or Data Lifecycle Manager.
Monitoring: Track VolumeReadOps, VolumeWriteOps, VolumeQueueLength, and BurstBalance (gp2 only) in CloudWatch.
Right-sizing: Monitor IOPS and throughput utilization. Don’t overprovision. gp3 allows independent IOPS and throughput configuration.
Multi-attach (io2 only): Enable up to 16 Nitro instances to concurrently attach to the same volume for clustered applications.
Instance Store (Ephemeral Storage)
Characteristics
- Temporary block-level storage physically attached to the host machine
- Sub-millisecond latency (fastest storage option)
- Data is lost when instance stops, terminates, hibernates, or underlying hardware fails
- Free (included with instance type)
- Size and type determined by instance type (not configurable)
- Cannot be detached and reattached to another instance
Performance
Instance store provides the lowest latency storage available on AWS:
- Direct attachment to host hardware (no network overhead)
- NVMe SSD storage on modern instance types
- Very high IOPS and throughput
- Ideal for applications requiring ultra-low latency
When to Use Instance Store
Ideal for:
- Temporary storage of constantly changing data (buffers, caches, scratch data)
- Data replicated across multiple instances (stateless web servers, load-balanced fleets)
- High-performance computing (HPC) requiring extremely low latency
- Storage-optimized workloads (I-series) for NoSQL databases with replication
- Applications that checkpoint frequently and can recover from instance loss
When NOT to use:
- Critical data that must survive instance termination
- Databases without replication to other nodes
- Data requiring backups
- Long-term persistent storage
Best Practices
Never store irreplaceable data: Treat instance store as disposable. Replicate critical data to EBS, EFS, or S3.
Use for replicated data: Configure applications (Cassandra, Elasticsearch) to replicate data across multiple instances with instance store.
RAID configurations: Use software RAID to combine multiple instance store volumes for increased performance or redundancy.
Checkpoint frequently: For long-running computations, save checkpoints to persistent storage (EBS, S3) periodically.
EFS (Elastic File System)
Characteristics
- Fully managed Network File System (NFS)
- Automatically grows and shrinks based on usage (no capacity planning)
- Can be mounted concurrently by thousands of EC2 instances
- Accessible across multiple Availability Zones within a region
- Accessible from on-premises via AWS Direct Connect or VPN
- File-level storage (not block-level)
- Pay only for storage used (no pre-provisioning)
- Supports NFS v4.0 and v4.1 protocols
Performance Modes
General Purpose (default):
- Lowest latency
- Up to 7,000 file operations per second
- Best for most workloads
Max I/O:
- Higher aggregate throughput and IOPS
- Slightly higher latency
- Best for thousands of concurrent instances
Throughput Modes
Bursting (default):
- Throughput scales with storage size
- 50 MB/s per TB of storage, burst to 100 MB/s
- Credit system similar to T-series instances
Elastic (recommended for most):
- Automatically scales throughput based on workload
- No need to provision
- Pay only for throughput used
Provisioned:
- Specify throughput independent of storage size
- Useful when throughput requirements exceed bursting capacity
When to Use EFS
Ideal for:
- Shared file storage across multiple EC2 instances
- Web serving and content management systems (WordPress, Drupal)
- Development and testing environments (shared code repositories)
- Big data analytics requiring shared access
- Media processing workflows (video transcoding, rendering)
- Home directories for users
- Container storage (persistent volumes for ECS, EKS)
When NOT to use:
- Single-instance, high-IOPS databases (use EBS instead)
- Latency-sensitive applications requiring sub-millisecond access (use instance store)
- Windows workloads (EFS only supports Linux/NFS; use FSx for Windows)
EFS Best Practices
Lifecycle management: Enable EFS Lifecycle Management to automatically transition infrequently accessed files to lower-cost storage classes (Infrequent Access, Archive).
Storage classes:
- Standard: Frequently accessed files
- Infrequent Access (IA): Files not accessed for 7, 14, 30, 60, or 90 days (configurable)
- Archive: Files not accessed for 90 days (lowest cost)
Encryption: Enable encryption at rest using AWS KMS. Encryption in transit available via TLS.
Access points: Use EFS Access Points to enforce user identity and permissions, simplifying application access.
VPC security: Use security groups to control which instances can mount the file system.
Monitoring: Track ClientConnections, DataReadIOBytes, DataWriteIOBytes, and PercentIOLimit in CloudWatch.
Storage Comparison
| Feature | EBS | Instance Store | EFS |
|---|---|---|---|
| Persistence | Persistent | Ephemeral | Persistent |
| Latency | Low (milliseconds) | Lowest (sub-millisecond) | Moderate (network latency) |
| Attachment | Single instance* | Single instance | Multiple instances |
| Scope | Single AZ | Single instance | Multi-AZ, multi-region |
| Capacity | 1 GB - 64 TB | Fixed per instance | Unlimited (automatic scaling) |
| Use case | Databases, boot volumes | Caches, temp data, HPC | Shared file storage |
| Cost model | Pay for provisioned capacity | Included with instance | Pay for usage |
| Replication | Within AZ only | None | Across AZs automatically |
| Snapshots | Yes (incremental) | No | AWS Backup supported |
*io2 supports multi-attach up to 16 instances
Choosing the Right Storage
Decision framework:
Does data need to survive instance termination?
- No → Instance Store
- Yes → EBS or EFS
Do multiple instances need concurrent access?
- Yes → EFS
- No → EBS or Instance Store
What latency requirements?
- Sub-millisecond → Instance Store
- Low (milliseconds) → EBS
- Moderate (network latency acceptable) → EFS
What’s the primary access pattern?
- Block-level, database → EBS
- Ultra-fast temporary → Instance Store
- File-level, shared → EFS
Example architectures:
- Web application: EBS for boot volume and application files, Instance Store for session caching, EFS for shared uploads/media
- Database: EBS io2 for data volumes (high IOPS), EBS gp3 for backups
- Analytics cluster: Instance Store for temporary processing, EFS for shared datasets, S3 for final results
- Container orchestration: EBS for node boot volumes, EFS for persistent volumes shared across pods
Service Comparison: EC2 vs Lambda vs Containers
AWS offers multiple compute services, each with distinct trade-offs. Understanding when to use each service is critical for system architects.
When to Choose EC2
Best for:
- Applications requiring full control over the server environment (OS, kernel modules, network configuration, storage)
- Consistently high or predictable workloads that run continuously
- Long-running tasks with variable or unpredictable execution times
- Legacy applications requiring traditional server environments
- Workloads needing specific hardware (GPUs, Graviton processors, high-memory instances)
- Applications with strict compliance requirements (HIPAA, PCI DSS) requiring dedicated infrastructure
- Monolithic applications difficult to decompose into smaller services
Advantages:
- Maximum control and flexibility
- No execution time limits
- Predictable costs for consistent workloads
- Wide variety of instance types for specific needs
- Support for any OS or runtime
Trade-offs:
- Higher operational overhead (provisioning, scaling, patching, monitoring)
- Pay for instance hours even if underutilized
- Responsibility for OS security updates and configuration management
- Manual or semi-automated scaling
Cost model: Pay per second for instance runtime (minimum 60 seconds).
When to Choose Lambda (Serverless)
Best for:
- Event-driven architectures (HTTP requests via API Gateway, S3 file uploads, DynamoDB changes, SQS messages)
- Variable or infrequent workloads with idle periods
- Short-lived functions (maximum 15 minutes / 900 seconds execution time)
- Microservices requiring minimal operational overhead
- Rapid deployment of small, on-demand applications
- Cost-sensitive intermittent workloads (pay only for execution time)
- Backend for mobile and web applications
Advantages:
- Zero server management (fully managed)
- Automatic scaling (handles 1 to 10,000+ concurrent executions)
- Pay only for actual execution time (millisecond granularity)
- Built-in high availability and fault tolerance
- Integrated with AWS services (S3, DynamoDB, Kinesis, SQS, etc.)
Limitations:
- 15-minute maximum execution time (hard limit)
- Cold start latency (100ms - 1s+) when function invoked after idle period
- Limited runtime environment customization
- Maximum deployment package size (250 MB unzipped)
- Memory allocation limits (128 MB - 10 GB)
- Difficult to debug and test locally
- Higher cost for high-frequency, long-running workloads
Cost model: Pay per request ($0.20 per 1M requests) + compute time ($0.0000166667 per GB-second). Free tier includes 1M requests and 400,000 GB-seconds per month.
Cost comparison example:
- EC2 t3.small running 24/7: ~$15/month
- Lambda equivalent (~2 GB memory): If running 24/7, ~$50/month
- Lambda intermittent (1 hour/day): ~$2/month
Lambda is cost-effective for intermittent use, expensive for continuous operation.
When to Choose Containers (ECS/EKS)
Amazon ECS (Elastic Container Service)
Best for:
- Docker container orchestration without Kubernetes complexity
- Tasks or batch jobs running longer than 15 minutes (avoids Lambda timeout)
- Teams wanting container benefits without Kubernetes learning curve
- AWS-native architectures (good integration with ALB, CloudWatch, IAM, Secrets Manager)
- Microservices requiring isolation and portability
- Mixed workloads (long-running services + batch jobs)
Advantages:
- Simpler than Kubernetes (less operational complexity)
- Deep AWS integration
- Supports both Fargate (serverless) and EC2 launch types
- No control plane management fees (unlike EKS)
- Task definitions version-controlled
ECS Launch Types:
- Fargate: Serverless container execution (no EC2 management), pay per vCPU/memory per second
- EC2: Run containers on self-managed EC2 instances, more control, lower cost for consistent workloads
Amazon EKS (Elastic Kubernetes Service)
Best for:
- Organizations already using Kubernetes or planning adoption
- Complex orchestration needs (StatefulSets, DaemonSets, custom controllers)
- Multi-cloud or hybrid cloud strategies (Kubernetes portability)
- Large-scale microservices architectures
- Advanced deployment patterns (blue/green, canary, A/B testing) using service mesh
- Need for Kubernetes ecosystem tools (Helm, Operators, Prometheus, Istio)
Advantages:
- Industry-standard Kubernetes API and tooling
- Portability across cloud providers and on-premises
- Large ecosystem of tools and community support
- Advanced scheduling and orchestration features
Trade-offs:
- Higher complexity (Kubernetes learning curve)
- Control plane costs ($0.10/hour = ~$73/month per cluster)
- More operational overhead than ECS
- Requires Kubernetes expertise
Containers General Benefits
Why choose containers (ECS or EKS)?
- Portability across environments (dev/test/prod consistency)
- Isolation with less overhead than VMs
- Efficient resource utilization (pack multiple containers per instance)
- Faster deployment than EC2 (container images vs AMI builds)
- Immutable infrastructure (version-controlled container images)
- CI/CD integration (automated builds and deployments)
- Microservices architectures with independent scaling
Decision Framework
Control Requirements
Maximum control needed (OS, kernel, hardware):
- Choose: EC2
Moderate control (application and dependencies):
- Choose: Containers (ECS/EKS)
Minimal control (just code):
- Choose: Lambda
Operational Overhead Tolerance
Want minimal operational overhead:
- Choose: Lambda (fully managed) or ECS on Fargate (serverless containers)
Willing to manage orchestration:
- Choose: ECS on EC2 or EKS
Want full infrastructure control:
- Choose: EC2
Execution Time Requirements
Under 15 minutes, event-driven:
- Consider: Lambda
Over 15 minutes or continuous operation:
- Choose: EC2 or Containers
Workload Predictability
Unpredictable, intermittent, event-driven:
- Choose: Lambda (cost-effective, auto-scales)
Predictable baseline with occasional spikes:
- Choose: Containers with auto-scaling or EC2 with Auto Scaling Groups
Consistently high utilization:
- Choose: EC2 (most cost-effective with Reserved Instances/Savings Plans)
Application Architecture
Monolithic application:
- Choose: EC2 (easier to lift-and-shift)
Microservices:
- Choose: Containers (ECS/EKS) or Lambda (for small services)
Mixed (microservices + monolith):
- Choose: Containers for flexibility
Team Expertise
Team knows Kubernetes:
- Consider: EKS
Team wants simplicity:
- Choose: Lambda (events) or ECS (containers)
Team has traditional ops experience:
- Choose: EC2 (familiar model)
Example Scenarios
E-commerce website:
- Frontend: Lambda + API Gateway (handles variable traffic)
- Product catalog: ECS on Fargate (microservices, moderate scaling)
- Order processing: ECS on EC2 (predictable load, cost-optimized)
- Database: EC2 with EBS io2 (full control, high IOPS)
Data processing pipeline:
- Ingestion: Lambda triggered by S3 uploads
- Processing: ECS tasks on Fargate (jobs run 30+ minutes)
- Analytics: EC2 with instance store (high-performance computing)
Enterprise application migration:
- Phase 1: Lift-and-shift to EC2 (minimal changes)
- Phase 2: Containerize components to ECS (modernization)
- Phase 3: Decompose to microservices, move functions to Lambda (cloud-native)
Key Insight
There’s overlap between EC2, containers, and Lambda, but each service has limitations. The right choice depends on your specific requirements for control, operational overhead, execution time, cost model, and team expertise. Many architectures use multiple compute services, selecting the best fit for each component.
Common Pitfalls
Security Misconfigurations
Storing Secrets in Plain Text
Problem: Hardcoding credentials in application code, configuration files, or AMIs exposes secrets if code is leaked or AMIs are shared.
Fix:
- Use AWS Secrets Manager for database credentials, API keys, and secrets
- Use Systems Manager Parameter Store for configuration data
- Grant instances IAM role permissions to retrieve specific secrets
- Never hardcode secrets in code, config files, or AMIs
- Rotate secrets regularly using Secrets Manager automatic rotation
Overly Permissive Security Group Rules
Problem: Security groups allowing 0.0.0.0/0 (all internet) access to databases, internal services, or SSH/RDP expose infrastructure to attacks.
Fix:
- Implement least permissive rules (only allow required sources)
- Create custom security groups for each application tier
- Use security group chaining (reference other security groups in rules)
- Regularly audit security group rules using AWS Firewall Manager
- Never allow 0.0.0.0/0 for databases or internal services
- Restrict SSH/RDP to specific bastion host security groups or IP ranges
Not Enforcing IMDSv2
Problem: IMDSv1 is vulnerable to SSRF attacks, allowing attackers to steal IAM role credentials from metadata service.
Fix:
- Enforce IMDSv2 in all launch templates with
HttpTokens: required - Disable IMDSv1 across infrastructure
- Use AWS Config rule
ec2-imdsv2-checkfor compliance monitoring - Implement organizational policy requiring IMDSv2 for new instances
Public AMIs with Sensitive Data
Problem: Accidentally making AMIs public exposes data baked into the AMI (credentials, configuration, proprietary software).
Fix:
- Regularly audit AMI permissions
- Enable EBS encryption by default to prevent unencrypted AMI creation
- Remove sensitive data before creating AMIs
- Use golden AMI pipelines with automated security scanning
- Review permissions before sharing AMIs
Infrastructure Management Mistakes
Manual Infrastructure Management (ClickOps)
Problem: Managing infrastructure through the console is error-prone, not reproducible, lacks version control, and doesn’t scale.
Fix:
- Use Infrastructure as Code (CloudFormation, Terraform, CDK)
- Version control infrastructure definitions in Git
- Make infrastructure reproducible and documented
- Use CI/CD pipelines for infrastructure changes
- Code review infrastructure changes before deployment
Not Using Auto Scaling Groups
Problem: Managing individual instances manually means no automatic health monitoring, no automatic replacement on failure, and difficult scaling.
Fix:
- Launch every EC2 instance inside an Auto Scaling Group, even single instances
- ASGs provide free monitoring and automatic replacement
- ASGs simplify updates using instance refresh
- ASGs enable easier scaling when requirements change
- Use launch templates for version-controlled configuration
Using Deprecated Launch Configurations
Problem: Launch configurations are deprecated and don’t support new instance types or features. Accounts created after October 1, 2024 cannot create launch configurations.
Fix:
- Migrate to launch templates immediately
- Launch templates offer versioning, mixed instance types, Spot+On-Demand mixing
- Use CloudFormation or Terraform to manage launch templates as code
- Test launch template versions before updating ASGs
Cost Management Failures
Running Underutilized Instances
Problem: Overprovisioned instances waste money. Common scenario: m5.2xlarge instance running at 10% CPU utilization.
Fix:
- Monitor CloudWatch metrics regularly (CPU, memory, network, disk)
- Use AWS Compute Optimizer for rightsizing recommendations
- Enable CloudWatch Agent for memory metrics
- Right-size before purchasing Reserved Instances or Savings Plans
- Set up cost anomaly detection alerts
Forgetting to Stop/Terminate Unused Instances
Problem: Development, testing, or POC instances left running 24/7 accumulate significant costs.
Fix:
- Implement comprehensive tagging strategy (owner, environment, purpose, expiration)
- Use AWS Instance Scheduler for dev/test environments (auto-stop nights/weekends)
- Set up cost alerts and budget notifications
- Use AWS Cost Explorer to identify unused or idle resources
- Implement auto-shutdown policies for non-production environments
- Regular cost review meetings to identify optimization opportunities
Not Leveraging Commitment Discounts
Problem: 53% of organizations use no commitment discounts, leaving significant savings (up to 72%) on the table.
Fix:
- Analyze workload patterns using AWS Cost Explorer
- Purchase Savings Plans or Reserved Instances for predictable workloads
- Start conservative (cover 60-70% of baseline usage)
- Mix pricing models: Savings Plans (flexibility) + RIs (specific workloads) + Spot (fault-tolerant)
- Review and adjust quarterly as usage patterns change
- Use Consolidated Billing to apply commitments across multiple accounts
Ignoring Graviton Instances
Problem: Not evaluating Graviton instances for Linux workloads misses 40% better price-performance opportunity.
Fix:
- Evaluate Graviton instances for all new Linux workloads
- Test ARM64 compatibility for existing applications
- Build multi-architecture container images (ARM64 + AMD64)
- Start with development/testing environments to validate compatibility
- Migrate production workloads after validation
Availability and Reliability Issues
Single Availability Zone Deployments
Problem: Deploying in a single AZ creates single point of failure. AZ outages are rare but happen.
Fix:
- Always deploy across at least two Availability Zones
- Use Auto Scaling Groups with multi-AZ distribution
- Configure load balancers across multiple AZs
- Use Route 53 health checks for DNS failover
- Test failover procedures regularly
Shallow Health Checks
Problem: Health checks that only verify instance responsiveness don’t detect dependency failures (database down, API unavailable). Load balancer routes traffic to instances that return errors.
Fix:
- Implement deep health checks that verify critical dependencies
- Don’t just check if instance responds; verify application health
- Health check endpoint should test database connectivity, cache availability, API dependencies
- Return 200 OK only if all critical dependencies are healthy
- Ensure health checks detect gray failures (partial functionality)
Not Backing Up EBS Volumes
Problem: Without backups, data loss from instance termination, volume corruption, or accidental deletion is permanent.
Fix:
- Automate EBS snapshots using AWS Backup or Data Lifecycle Manager
- Define retention policies aligned with recovery objectives
- Create AMIs from instances to save entire configurations
- Test restore procedures regularly (backups you can’t restore are useless)
- Use cross-region snapshot copying for disaster recovery
- Tag snapshots with metadata for easier management
Monitoring and Operations Gaps
Neglecting Patch Management
Problem: Unpatched systems are vulnerable to known exploits. Manual patching is inconsistent and error-prone.
Fix:
- Use AWS Systems Manager Patch Manager for automated patching
- Define maintenance windows for updates
- Test patches in non-production environments first
- Track patch compliance across instances
- Subscribe to AWS security bulletins for critical updates
- Replace instances regularly using golden AMI pipelines
Insufficient Monitoring
Problem: Not monitoring key metrics means you don’t know when performance degrades, costs spike, or failures occur until users complain.
Fix:
- Monitor CPU, memory, network, and disk metrics in CloudWatch
- Set CloudWatch alarms for anomalies and threshold breaches
- Enable detailed monitoring (1-minute intervals) for critical workloads
- Use CloudWatch Logs for centralized log aggregation
- Implement distributed tracing (AWS X-Ray) for microservices
- Create dashboards for real-time visibility
Lack of Cost Visibility
Problem: Without granular cost tracking, you can’t identify which applications, teams, or environments drive costs.
Fix:
- Implement comprehensive tagging strategy (application, environment, owner, cost-center)
- Enable Cost Allocation Tags in billing settings
- Use AWS Cost Explorer for analysis and trends
- Enable AWS Cost Anomaly Detection for unexpected spikes
- Create budget alerts for proactive notifications
- Regular cost review meetings with stakeholders
- Implement showback or chargeback for accountability
Key Takeaways
1. Instance selection drives cost and performance: Choose the right instance family based on workload characteristics (general purpose, compute optimized, memory optimized, storage optimized, accelerated computing). Use Graviton for 40% better price-performance on Linux workloads. Test with representative workloads before committing.
2. AWS Nitro System provides the foundation: All modern EC2 instances are built on Nitro, delivering near-zero virtualization overhead, hardware-based security, and high-performance networking. Leverage Nitro Enclaves for processing highly sensitive data with cryptographic attestation.
3. Cost optimization requires layered strategy: Mix Savings Plans (flexibility), Reserved Instances (predictable workloads), and Spot Instances (fault-tolerant workloads) to maximize savings (up to 72% for commitments, up to 90% for Spot). Right-size using Compute Optimizer before committing. 53% of organizations leave money on the table by not using commitments.
4. Security starts with IMDSv2: Enforce IMDSv2 and disable IMDSv1 to prevent SSRF attacks on instance metadata. Use IAM roles instead of long-term access keys. Apply least privilege permissions. Restrict credential usage to specific VPCs/IPs. Layer security groups (instance-level) with NACLs (subnet-level).
5. High availability requires multi-AZ architecture: Deploy across at least two Availability Zones. Use Auto Scaling Groups for every instance, even single instances (free monitoring and automatic replacement). Integrate with load balancers using deep health checks. Launch templates are now mandatory (launch configurations deprecated for new accounts).
6. Choose the right compute service: EC2 for full control and long-running workloads, Lambda for event-driven intermittent workloads under 15 minutes, containers (ECS for simplicity, EKS for Kubernetes) for microservices and portability. Many architectures use multiple compute services, selecting the best fit for each component.
7. Storage depends on use case and access patterns: EBS for persistent single-instance high-performance (databases, boot volumes), Instance Store for temporary ultra-low-latency (caches, HPC), EFS for shared multi-instance file access (web content, development environments). Choose based on persistence needs, latency requirements, and concurrent access patterns.
8. Avoid common pitfalls: Use Infrastructure as Code (not ClickOps). Monitor costs and metrics continuously. Implement comprehensive tagging. Right-size continuously. Automate patching. Use ASGs everywhere. Encrypt data at rest and in transit. Never store secrets in plain text. Implement deep health checks.
9. Integration amplifies capabilities: Leverage IAM roles for secure service access, VPC segmentation for network isolation, CloudWatch for monitoring and alerting, Systems Manager for automation and compliance, Route 53 for DNS failover, and load balancer health checks for robust architectures.
10. Testing is cost-effective with per-second billing: Test instance types, configurations, and architectures with real workloads before committing to long-term purchases. The ability to experiment cheaply is one of EC2’s greatest advantages—use it to find optimal configurations.
Found this guide helpful? Share it with your team:
Share on LinkedIn