AWS Well-Architected Framework

AWS

What is the AWS Well-Architected Framework
The Six Pillars
How to Use the Framework
Trade-Offs Between Pillars
Applying the Framework to Service Selection
Key Takeaways

What is the AWS Well-Architected Framework

The AWS Well-Architected Framework provides a consistent approach for evaluating cloud architectures and implementing designs that scale with your application needs. It describes key concepts, design principles, and architectural best practices for designing and operating workloads in the cloud.

Purpose

The framework helps architects:

Evaluate trade-offs between different architectural approaches
Make informed decisions about service selection and design patterns
Identify areas for improvement in existing architectures
Understand the business impact of architectural decisions

Not a Checklist

The framework provides guiding principles, not mandatory requirements. Every architectural decision involves trade-offs. The framework helps you understand what you’re optimizing for and what you’re sacrificing.

The Six Pillars

The framework organizes best practices into six pillars. Each pillar represents a different aspect of building cloud systems.

1. Operational Excellence

Definition: The ability to run and monitor systems to deliver business value and continually improve supporting processes and procedures.

Core Principles:

Perform operations as code
Make frequent, small, reversible changes
Refine operations procedures frequently
Anticipate failure
Learn from all operational events and failures

What This Means in Practice:

Infrastructure as code, automated deployments, comprehensive logging and monitoring, and blameless post-mortems. You treat operations like software development: version controlled, tested, and continuously improved.

Example Decisions:

Using CloudFormation or Terraform instead of manual console clicks
Implementing CI/CD pipelines with automated testing
Deploying blue-green or canary releases instead of big-bang deployments
Building dashboards and alerts in CloudWatch
Using Step Functions for visible, auditable workflows

Questions to Ask:

How do you manage changes to your infrastructure?
How do you monitor your workload to ensure it’s operating as expected?
How do you respond to unplanned operational events?
How do you evolve operations over time?

2. Security

Definition: The ability to protect data, systems, and assets while delivering business value through risk assessments and mitigation strategies.

Core Principles:

Implement a strong identity foundation (principle of least privilege)
Enable traceability (logging and auditing)
Apply security at all layers (defense in depth)
Automate security best practices
Protect data in transit and at rest
Keep people away from data (reduce manual access)
Prepare for security events

What This Means in Practice:

Every AWS service interaction goes through IAM. Data is encrypted at rest and in transit. CloudTrail logs every API call. Security groups and NACLs provide network-level protection. Secrets never appear in code or logs.

Example Decisions:

Using IAM Identity Center for workforce access (not IAM users with access keys)
Using IAM roles for workloads (EC2, Lambda, ECS) instead of long-lived access keys
Enabling MFA for human users (mandatory for root users as of 2024)
Encrypting S3 buckets with KMS
Storing database credentials in Secrets Manager
Using VPC security groups, NACLs, and PrivateLink for defense in depth
Enabling GuardDuty for threat detection
Implementing WAF rules on CloudFront and ALB

Questions to Ask:

How do you control access to your AWS resources?
How do you protect your data at rest and in transit?
How do you detect and respond to security events?
How do you keep your workload secure over time?

3. Reliability

Definition: The ability of a workload to perform its intended function correctly and consistently when expected, including recovering from failures.

Core Principles:

Automatically recover from failure
Test recovery procedures
Scale horizontally to increase aggregate workload availability
Stop guessing capacity (use auto-scaling)
Manage change through automation

What This Means in Practice:

Systems are designed to survive failures of individual components. Multi-AZ deployments prevent single availability zone failures from causing outages. Auto Scaling handles traffic spikes. Automated backups enable point-in-time recovery. Health checks automatically replace failed instances.

Example Decisions:

Deploying RDS with Multi-AZ enabled
Using Auto Scaling groups across multiple availability zones
Implementing health checks and automatic failover with ALB
Taking automated backups with retention policies
Using Route 53 health checks and failover routing
Designing for eventual consistency where appropriate
Testing chaos engineering scenarios (intentionally breaking things)

Questions to Ask:

How do you handle failures in your workload?
How do you design your workload to meet availability targets?
How do you test reliability?
How do you recover from failures?

4. Performance Efficiency

Definition: The ability to use computing resources efficiently to meet system requirements and maintain efficiency as demand changes and technologies evolve.

Core Principles:

Democratize advanced technologies (use managed services)
Go global in minutes (deploy to multiple regions)
Use serverless architectures
Experiment more often
Consider mechanical sympathy (understand how services work)

What This Means in Practice:

Choose the right compute type for the workload. Use caching to reduce latency. Leverage CDNs for global content delivery. Monitor performance metrics and optimize based on data, not assumptions. Take advantage of managed services that automatically scale and optimize.

Example Decisions:

Using Lambda for event-driven workloads instead of always-on EC2 instances
Choosing the right EC2 instance type (compute-optimized, memory-optimized, etc.)
Implementing CloudFront for global content delivery
Using DynamoDB DAX or ElastiCache for caching
Selecting RDS vs. DynamoDB based on access patterns
Using S3 Transfer Acceleration for large file uploads
Enabling Aurora Auto Scaling for read replicas

Questions to Ask:

How do you select the best performing architecture?
How do you monitor performance over time?
How do you use advanced technologies to improve performance?
How do you evolve your workload to take advantage of new AWS services?

5. Cost Optimization

Definition: The ability to run systems to deliver business value at the lowest price point while avoiding unnecessary costs.

Core Principles:

Implement cloud financial management
Adopt a consumption model (pay only for what you use)
Measure overall efficiency
Stop spending money on undifferentiated heavy lifting (use managed services)
Analyze and attribute expenditure

What This Means in Practice:

Right-size resources based on actual usage. Use Reserved Instances or Savings Plans for predictable workloads. Shut down non-production environments outside business hours. Leverage spot instances for fault-tolerant workloads. Monitor costs with AWS Cost Explorer and set budgets.

Example Decisions:

Purchasing Savings Plans for steady-state workloads
Using Lambda instead of always-on EC2 for infrequent tasks
Implementing auto-scaling to scale down during low traffic
Moving infrequently accessed data to S3 Glacier
Using spot instances for batch processing
Right-sizing EC2 instances based on CloudWatch metrics
Enabling S3 Intelligent-Tiering for automated cost optimization
Deleting unused EBS volumes and snapshots

Questions to Ask:

How do you govern usage and manage costs?
How do you monitor and control spending?
How do you select the most cost-effective resources?
How do you optimize over time?

6. Sustainability

Definition: The ability to continually improve sustainability impacts by reducing energy consumption and increasing efficiency across all components of a workload.

Core Principles:

Understand your impact (measure carbon footprint)
Establish sustainability goals
Maximize utilization (reduce idle resources)
Anticipate and adopt more efficient hardware and software
Use managed services (AWS optimizes infrastructure efficiency)
Reduce downstream impact (efficient data transfer and storage)

What This Means in Practice:

Choose regions powered by renewable energy. Use auto-scaling to avoid idle capacity. Leverage serverless and managed services that share infrastructure efficiently. Optimize data storage and transfer to reduce energy consumption.

Example Decisions:

Using Lambda or Fargate instead of over-provisioned EC2 instances
Implementing S3 lifecycle policies to move data to colder storage tiers
Using Graviton-based instances (more energy efficient)
Selecting regions with renewable energy commitments
Implementing caching to reduce repeated computations
Archiving or deleting unused data and resources
Choosing efficient data formats (Parquet instead of CSV)

Questions to Ask:

How do you minimize unused resources?
How do you optimize geographic placement based on sustainability goals?
How do you take advantage of more efficient hardware and software?
How do you reduce downstream sustainability impacts?

How to Use the Framework

1. Review Phase

Evaluate your architecture against the six pillars using the AWS Well-Architected Tool or manual review. For each pillar, ask the framework’s questions and identify areas where best practices aren’t followed.

2. Prioritize Improvements

Not every issue needs immediate attention. Prioritize based on:

Business impact: Which risks affect business outcomes most?
Current pain points: What’s causing operational problems today?
Compliance requirements: What must be addressed for regulatory reasons?
Technical debt: What will become harder to fix later?

3. Implement Changes

Make incremental improvements. Small, reversible changes reduce risk and allow learning. Document decisions and their rationale.

4. Measure and Iterate

Track metrics for each pillar. As the workload evolves and AWS releases new services, re-evaluate the architecture.

Trade-Offs Between Pillars

Every architectural decision involves trade-offs. Optimizing for one pillar often means compromising another.

Common Trade-Offs

Optimization	Gain	Trade-Off
Use managed services (RDS instead of self-managed DB on EC2)	Operational Excellence: Automated backups, patching, scaling Security: Built-in encryption and access controls Reliability: Multi-AZ deployments	Cost: Higher price than self-managed Performance: Less control over tuning
Enable Multi-AZ deployments	Reliability: Survive AZ failures Security: Data replicated securely	Cost: Pay for resources in multiple AZs Performance: Slight latency for synchronous replication
Use serverless (Lambda) instead of always-on EC2	Cost Optimization: Pay only for execution time Operational Excellence: No server management Sustainability: No idle capacity	Performance: Cold start latency Reliability: Execution time limits (15 min max)
Implement caching (CloudFront, ElastiCache)	Performance: Lower latency, faster response times Cost: Reduced origin load	Complexity: Cache invalidation strategies Operational Excellence: More components to monitor
Use Reserved Instances or Savings Plans	Cost Optimization: 40-75% savings vs. on-demand	Flexibility: Committed capacity may not match changing needs
Deploy to multiple regions	Reliability: Survive entire region failures Performance: Lower latency for global users	Cost: Resources in multiple regions Operational Excellence: More complex deployments and data synchronization

Making Intentional Trade-Offs

The framework doesn’t prescribe solutions. Instead, it helps you make intentional trade-offs based on your specific requirements:

Identify requirements: What does the business need? (SLAs, compliance, budget constraints)
Evaluate options: How does each service or pattern align with the six pillars?
Make explicit trade-offs: Document what you’re optimizing for and what you’re accepting as a compromise
Revisit over time: As requirements change or new AWS services launch, re-evaluate decisions

Example: A startup might prioritize cost optimization and operational excellence (use managed services, avoid over-engineering) while accepting lower reliability (single region, minimal redundancy) initially. As the business grows and SLAs become critical, they shift toward reliability even if it increases cost.

Applying the Framework to Service Selection

The framework guides service selection by mapping requirements to pillars.

Example: Choosing a Database

Scenario: You need to store user profile data for a web application.

Options: RDS (managed relational), DynamoDB (managed NoSQL), self-managed database on EC2

Framework Analysis:

Service	Operational Excellence	Security	Reliability	Performance	Cost	Sustainability
RDS (Aurora)	✅ Managed backups, patching, Multi-AZ ⚠️ Manual schema migrations	✅ Encryption, IAM, VPC	✅ Multi-AZ, automated failover, read replicas	✅ High performance for complex queries	⚠️ More expensive than DynamoDB for simple lookups	✅ Shared infrastructure
DynamoDB	✅ Fully managed, auto-scaling ✅ No schema migrations	✅ Encryption, IAM, VPC endpoints	✅ Multi-AZ by default, global tables	✅ Single-digit millisecond latency for key-value ⚠️ No complex joins	✅ Pay-per-request option ✅ Lower cost for simple access patterns	✅ Serverless, no idle capacity
Self-managed on EC2	❌ Manual backups, patching, scaling ✅ Full control	⚠️ Must implement encryption, access controls	⚠️ Must design Multi-AZ yourself	✅ Full tuning control	⚠️ Lower base cost but high operational cost	❌ Must manage idle capacity

Decision:

If you need complex queries and joins: RDS
If you need simple key-value lookups with massive scale: DynamoDB
If you need specific database features not available in managed services: Self-managed on EC2 (but reconsider if the operational burden is worth it)

Example: Choosing a Compute Service

Scenario: You need to process images uploaded by users.

Options: Lambda, ECS/Fargate, EC2

Framework Analysis:

Service	Operational Excellence	Security	Reliability	Performance	Cost	Use Case Fit
Lambda	✅ No servers to manage ⚠️ 15-minute execution limit	✅ Isolated execution environments	✅ Automatic scaling, built-in retry	⚠️ Cold starts ⚠️ Limited memory (10GB max)	✅ Pay per invocation	✅ Good for short tasks (<15 min)
ECS/Fargate	✅ No EC2 management ⚠️ Manage container images	✅ VPC isolation, IAM task roles	✅ Service auto-scaling	✅ No execution time limits	⚠️ Pay for running time, not invocations	✅ Good for long-running processes
EC2	❌ Manage instances, patching ✅ Full control	⚠️ Must configure security groups, patching	⚠️ Must implement auto-scaling, health checks	✅ Dedicated resources	⚠️ Always-on cost or scaling complexity	⚠️ Over-engineered for simple tasks

Decision:

If image processing takes <15 minutes and is event-driven: Lambda (best cost and operational simplicity)
If image processing is long-running or requires more than 10GB memory: ECS/Fargate
If you need GPU processing or very specific instance configurations: EC2

Key Takeaways

The framework is a lens for decision-making, not a compliance checklist.

Every architectural decision involves trade-offs across the six pillars. There is no perfect solution, only intentional choices based on your specific requirements.
Start with Operational Excellence and Security as foundational. Without these, the other pillars become difficult to achieve. Automate infrastructure management and enforce least-privilege access from day one.
Use the framework to evaluate AWS service options. When choosing between RDS and DynamoDB, or Lambda and EC2, map each option to the six pillars and see which aligns best with your requirements.
Prioritize the pillars based on business needs. A startup might prioritize Cost Optimization and Operational Excellence (lean operations, managed services). An enterprise might prioritize Security and Reliability (compliance, SLAs).
Revisit architectural decisions as requirements evolve. What made sense at launch may not make sense at scale. As traffic grows, you might trade Cost Optimization for Performance Efficiency. As the team grows, you might trade Operational Excellence (more managed services) for Cost (more self-managed).
Document trade-offs explicitly. Future teams need to understand why you chose Lambda over EC2, or DynamoDB over RDS. Without that context, they’ll assume incompetence rather than recognizing intentional trade-offs.
Use the AWS Well-Architected Tool for structured reviews. The tool provides guided questions and generates reports highlighting risks.

The framework doesn’t tell you what to build. It helps you understand what you’re optimizing for and what you’re accepting as a compromise.

Table of Contents

What is the AWS Well-Architected Framework

Purpose

Not a Checklist

The Six Pillars

1. Operational Excellence

2. Security

3. Reliability

4. Performance Efficiency

5. Cost Optimization

6. Sustainability

How to Use the Framework

1. Review Phase

2. Prioritize Improvements

3. Implement Changes

4. Measure and Iterate

Trade-Offs Between Pillars

Common Trade-Offs

Making Intentional Trade-Offs

Applying the Framework to Service Selection

Example: Choosing a Database

Example: Choosing a Compute Service

Key Takeaways