AWS Well-Architected Framework

AWS

Table of Contents

  1. What is the AWS Well-Architected Framework
  2. The Six Pillars
  3. How to Use the Framework
  4. Trade-Offs Between Pillars
  5. Applying the Framework to Service Selection
  6. Key Takeaways

What is the AWS Well-Architected Framework

The AWS Well-Architected Framework provides a consistent approach for evaluating cloud architectures and implementing designs that scale with your application needs. It describes key concepts, design principles, and architectural best practices for designing and operating workloads in the cloud.

Purpose

The framework helps architects:

  • Evaluate trade-offs between different architectural approaches
  • Make informed decisions about service selection and design patterns
  • Identify areas for improvement in existing architectures
  • Understand the business impact of architectural decisions

Not a Checklist

The framework provides guiding principles, not mandatory requirements. Every architectural decision involves trade-offs. The framework helps you understand what you’re optimizing for and what you’re sacrificing.


The Six Pillars

The framework organizes best practices into six pillars. Each pillar represents a different aspect of building cloud systems.

1. Operational Excellence

Definition: The ability to run and monitor systems to deliver business value and continually improve supporting processes and procedures.

Core Principles:

  • Perform operations as code
  • Make frequent, small, reversible changes
  • Refine operations procedures frequently
  • Anticipate failure
  • Learn from all operational events and failures

What This Means in Practice:

Infrastructure as code, automated deployments, comprehensive logging and monitoring, and blameless post-mortems. You treat operations like software development: version controlled, tested, and continuously improved.

Example Decisions:

  • Using CloudFormation or Terraform instead of manual console clicks
  • Implementing CI/CD pipelines with automated testing
  • Deploying blue-green or canary releases instead of big-bang deployments
  • Building dashboards and alerts in CloudWatch
  • Using Step Functions for visible, auditable workflows

Questions to Ask:

  • How do you manage changes to your infrastructure?
  • How do you monitor your workload to ensure it’s operating as expected?
  • How do you respond to unplanned operational events?
  • How do you evolve operations over time?

2. Security

Definition: The ability to protect data, systems, and assets while delivering business value through risk assessments and mitigation strategies.

Core Principles:

  • Implement a strong identity foundation (principle of least privilege)
  • Enable traceability (logging and auditing)
  • Apply security at all layers (defense in depth)
  • Automate security best practices
  • Protect data in transit and at rest
  • Keep people away from data (reduce manual access)
  • Prepare for security events

What This Means in Practice:

Every AWS service interaction goes through IAM. Data is encrypted at rest and in transit. CloudTrail logs every API call. Security groups and NACLs provide network-level protection. Secrets never appear in code or logs.

Example Decisions:

  • Using IAM Identity Center for workforce access (not IAM users with access keys)
  • Using IAM roles for workloads (EC2, Lambda, ECS) instead of long-lived access keys
  • Enabling MFA for human users (mandatory for root users as of 2024)
  • Encrypting S3 buckets with KMS
  • Storing database credentials in Secrets Manager
  • Using VPC security groups, NACLs, and PrivateLink for defense in depth
  • Enabling GuardDuty for threat detection
  • Implementing WAF rules on CloudFront and ALB

Questions to Ask:

  • How do you control access to your AWS resources?
  • How do you protect your data at rest and in transit?
  • How do you detect and respond to security events?
  • How do you keep your workload secure over time?

3. Reliability

Definition: The ability of a workload to perform its intended function correctly and consistently when expected, including recovering from failures.

Core Principles:

  • Automatically recover from failure
  • Test recovery procedures
  • Scale horizontally to increase aggregate workload availability
  • Stop guessing capacity (use auto-scaling)
  • Manage change through automation

What This Means in Practice:

Systems are designed to survive failures of individual components. Multi-AZ deployments prevent single availability zone failures from causing outages. Auto Scaling handles traffic spikes. Automated backups enable point-in-time recovery. Health checks automatically replace failed instances.

Example Decisions:

  • Deploying RDS with Multi-AZ enabled
  • Using Auto Scaling groups across multiple availability zones
  • Implementing health checks and automatic failover with ALB
  • Taking automated backups with retention policies
  • Using Route 53 health checks and failover routing
  • Designing for eventual consistency where appropriate
  • Testing chaos engineering scenarios (intentionally breaking things)

Questions to Ask:

  • How do you handle failures in your workload?
  • How do you design your workload to meet availability targets?
  • How do you test reliability?
  • How do you recover from failures?

4. Performance Efficiency

Definition: The ability to use computing resources efficiently to meet system requirements and maintain efficiency as demand changes and technologies evolve.

Core Principles:

  • Democratize advanced technologies (use managed services)
  • Go global in minutes (deploy to multiple regions)
  • Use serverless architectures
  • Experiment more often
  • Consider mechanical sympathy (understand how services work)

What This Means in Practice:

Choose the right compute type for the workload. Use caching to reduce latency. Leverage CDNs for global content delivery. Monitor performance metrics and optimize based on data, not assumptions. Take advantage of managed services that automatically scale and optimize.

Example Decisions:

  • Using Lambda for event-driven workloads instead of always-on EC2 instances
  • Choosing the right EC2 instance type (compute-optimized, memory-optimized, etc.)
  • Implementing CloudFront for global content delivery
  • Using DynamoDB DAX or ElastiCache for caching
  • Selecting RDS vs. DynamoDB based on access patterns
  • Using S3 Transfer Acceleration for large file uploads
  • Enabling Aurora Auto Scaling for read replicas

Questions to Ask:

  • How do you select the best performing architecture?
  • How do you monitor performance over time?
  • How do you use advanced technologies to improve performance?
  • How do you evolve your workload to take advantage of new AWS services?

5. Cost Optimization

Definition: The ability to run systems to deliver business value at the lowest price point while avoiding unnecessary costs.

Core Principles:

  • Implement cloud financial management
  • Adopt a consumption model (pay only for what you use)
  • Measure overall efficiency
  • Stop spending money on undifferentiated heavy lifting (use managed services)
  • Analyze and attribute expenditure

What This Means in Practice:

Right-size resources based on actual usage. Use Reserved Instances or Savings Plans for predictable workloads. Shut down non-production environments outside business hours. Leverage spot instances for fault-tolerant workloads. Monitor costs with AWS Cost Explorer and set budgets.

Example Decisions:

  • Purchasing Savings Plans for steady-state workloads
  • Using Lambda instead of always-on EC2 for infrequent tasks
  • Implementing auto-scaling to scale down during low traffic
  • Moving infrequently accessed data to S3 Glacier
  • Using spot instances for batch processing
  • Right-sizing EC2 instances based on CloudWatch metrics
  • Enabling S3 Intelligent-Tiering for automated cost optimization
  • Deleting unused EBS volumes and snapshots

Questions to Ask:

  • How do you govern usage and manage costs?
  • How do you monitor and control spending?
  • How do you select the most cost-effective resources?
  • How do you optimize over time?

6. Sustainability

Definition: The ability to continually improve sustainability impacts by reducing energy consumption and increasing efficiency across all components of a workload.

Core Principles:

  • Understand your impact (measure carbon footprint)
  • Establish sustainability goals
  • Maximize utilization (reduce idle resources)
  • Anticipate and adopt more efficient hardware and software
  • Use managed services (AWS optimizes infrastructure efficiency)
  • Reduce downstream impact (efficient data transfer and storage)

What This Means in Practice:

Choose regions powered by renewable energy. Use auto-scaling to avoid idle capacity. Leverage serverless and managed services that share infrastructure efficiently. Optimize data storage and transfer to reduce energy consumption.

Example Decisions:

  • Using Lambda or Fargate instead of over-provisioned EC2 instances
  • Implementing S3 lifecycle policies to move data to colder storage tiers
  • Using Graviton-based instances (more energy efficient)
  • Selecting regions with renewable energy commitments
  • Implementing caching to reduce repeated computations
  • Archiving or deleting unused data and resources
  • Choosing efficient data formats (Parquet instead of CSV)

Questions to Ask:

  • How do you minimize unused resources?
  • How do you optimize geographic placement based on sustainability goals?
  • How do you take advantage of more efficient hardware and software?
  • How do you reduce downstream sustainability impacts?

How to Use the Framework

1. Review Phase

Evaluate your architecture against the six pillars using the AWS Well-Architected Tool or manual review. For each pillar, ask the framework’s questions and identify areas where best practices aren’t followed.

2. Prioritize Improvements

Not every issue needs immediate attention. Prioritize based on:

  • Business impact: Which risks affect business outcomes most?
  • Current pain points: What’s causing operational problems today?
  • Compliance requirements: What must be addressed for regulatory reasons?
  • Technical debt: What will become harder to fix later?

3. Implement Changes

Make incremental improvements. Small, reversible changes reduce risk and allow learning. Document decisions and their rationale.

4. Measure and Iterate

Track metrics for each pillar. As the workload evolves and AWS releases new services, re-evaluate the architecture.


Trade-Offs Between Pillars

Every architectural decision involves trade-offs. Optimizing for one pillar often means compromising another.

Common Trade-Offs

Optimization Gain Trade-Off
Use managed services (RDS instead of self-managed DB on EC2) Operational Excellence: Automated backups, patching, scaling
Security: Built-in encryption and access controls
Reliability: Multi-AZ deployments
Cost: Higher price than self-managed
Performance: Less control over tuning
Enable Multi-AZ deployments Reliability: Survive AZ failures
Security: Data replicated securely
Cost: Pay for resources in multiple AZs
Performance: Slight latency for synchronous replication
Use serverless (Lambda) instead of always-on EC2 Cost Optimization: Pay only for execution time
Operational Excellence: No server management
Sustainability: No idle capacity
Performance: Cold start latency
Reliability: Execution time limits (15 min max)
Implement caching (CloudFront, ElastiCache) Performance: Lower latency, faster response times
Cost: Reduced origin load
Complexity: Cache invalidation strategies
Operational Excellence: More components to monitor
Use Reserved Instances or Savings Plans Cost Optimization: 40-75% savings vs. on-demand Flexibility: Committed capacity may not match changing needs
Deploy to multiple regions Reliability: Survive entire region failures
Performance: Lower latency for global users
Cost: Resources in multiple regions
Operational Excellence: More complex deployments and data synchronization

Making Intentional Trade-Offs

The framework doesn’t prescribe solutions. Instead, it helps you make intentional trade-offs based on your specific requirements:

  1. Identify requirements: What does the business need? (SLAs, compliance, budget constraints)
  2. Evaluate options: How does each service or pattern align with the six pillars?
  3. Make explicit trade-offs: Document what you’re optimizing for and what you’re accepting as a compromise
  4. Revisit over time: As requirements change or new AWS services launch, re-evaluate decisions

Example: A startup might prioritize cost optimization and operational excellence (use managed services, avoid over-engineering) while accepting lower reliability (single region, minimal redundancy) initially. As the business grows and SLAs become critical, they shift toward reliability even if it increases cost.


Applying the Framework to Service Selection

The framework guides service selection by mapping requirements to pillars.

Example: Choosing a Database

Scenario: You need to store user profile data for a web application.

Options: RDS (managed relational), DynamoDB (managed NoSQL), self-managed database on EC2

Framework Analysis:

Service Operational Excellence Security Reliability Performance Cost Sustainability
RDS (Aurora) ✅ Managed backups, patching, Multi-AZ
⚠️ Manual schema migrations
✅ Encryption, IAM, VPC ✅ Multi-AZ, automated failover, read replicas ✅ High performance for complex queries ⚠️ More expensive than DynamoDB for simple lookups ✅ Shared infrastructure
DynamoDB ✅ Fully managed, auto-scaling
✅ No schema migrations
✅ Encryption, IAM, VPC endpoints ✅ Multi-AZ by default, global tables ✅ Single-digit millisecond latency for key-value
⚠️ No complex joins
✅ Pay-per-request option
✅ Lower cost for simple access patterns
✅ Serverless, no idle capacity
Self-managed on EC2 ❌ Manual backups, patching, scaling
✅ Full control
⚠️ Must implement encryption, access controls ⚠️ Must design Multi-AZ yourself ✅ Full tuning control ⚠️ Lower base cost but high operational cost ❌ Must manage idle capacity

Decision:

  • If you need complex queries and joins: RDS
  • If you need simple key-value lookups with massive scale: DynamoDB
  • If you need specific database features not available in managed services: Self-managed on EC2 (but reconsider if the operational burden is worth it)

Example: Choosing a Compute Service

Scenario: You need to process images uploaded by users.

Options: Lambda, ECS/Fargate, EC2

Framework Analysis:

Service Operational Excellence Security Reliability Performance Cost Use Case Fit
Lambda ✅ No servers to manage
⚠️ 15-minute execution limit
✅ Isolated execution environments ✅ Automatic scaling, built-in retry ⚠️ Cold starts
⚠️ Limited memory (10GB max)
✅ Pay per invocation ✅ Good for short tasks (<15 min)
ECS/Fargate ✅ No EC2 management
⚠️ Manage container images
✅ VPC isolation, IAM task roles ✅ Service auto-scaling ✅ No execution time limits ⚠️ Pay for running time, not invocations ✅ Good for long-running processes
EC2 ❌ Manage instances, patching
✅ Full control
⚠️ Must configure security groups, patching ⚠️ Must implement auto-scaling, health checks ✅ Dedicated resources ⚠️ Always-on cost or scaling complexity ⚠️ Over-engineered for simple tasks

Decision:

  • If image processing takes <15 minutes and is event-driven: Lambda (best cost and operational simplicity)
  • If image processing is long-running or requires more than 10GB memory: ECS/Fargate
  • If you need GPU processing or very specific instance configurations: EC2

Key Takeaways

The framework is a lens for decision-making, not a compliance checklist.

  1. Every architectural decision involves trade-offs across the six pillars. There is no perfect solution, only intentional choices based on your specific requirements.

  2. Start with Operational Excellence and Security as foundational. Without these, the other pillars become difficult to achieve. Automate infrastructure management and enforce least-privilege access from day one.

  3. Use the framework to evaluate AWS service options. When choosing between RDS and DynamoDB, or Lambda and EC2, map each option to the six pillars and see which aligns best with your requirements.

  4. Prioritize the pillars based on business needs. A startup might prioritize Cost Optimization and Operational Excellence (lean operations, managed services). An enterprise might prioritize Security and Reliability (compliance, SLAs).

  5. Revisit architectural decisions as requirements evolve. What made sense at launch may not make sense at scale. As traffic grows, you might trade Cost Optimization for Performance Efficiency. As the team grows, you might trade Operational Excellence (more managed services) for Cost (more self-managed).

  6. Document trade-offs explicitly. Future teams need to understand why you chose Lambda over EC2, or DynamoDB over RDS. Without that context, they’ll assume incompetence rather than recognizing intentional trade-offs.

  7. Use the AWS Well-Architected Tool for structured reviews. The tool provides guided questions and generates reports highlighting risks.

The framework doesn’t tell you what to build. It helps you understand what you’re optimizing for and what you’re accepting as a compromise.

Found this guide helpful? Share it with your team:

Share on LinkedIn