AWS Container Services: ECS, EKS, and Fargate
Table of Contents
- What Problems Containers Solve
- Amazon ECS (Elastic Container Service)
- AWS Fargate
- Amazon EKS (Elastic Kubernetes Service)
- Container Networking
- Storage for Containers
- Security Best Practices
- Observability
- Deployment Patterns
- Cost Optimization
- Service Selection Framework
- Common Pitfalls
- Key Takeaways
What Problems Containers Solve
Containers address several key challenges compared to traditional EC2 instances and serverless Lambda functions.
vs. EC2 Instances
Consistency across environments: Containers package applications with all dependencies, eliminating “works on my machine” problems. The same container runs identically in development, staging, and production.
Resource efficiency: Multiple containers share the host OS kernel, using less memory and CPU than separate virtual machines. A single EC2 instance can run dozens of containers.
Faster deployment: Container startup takes seconds vs. minutes for EC2 instances. Immutable container images enable rapid rollbacks.
Portability: Containers abstract infrastructure details. Move workloads between ECS, EKS, on-premises, and other cloud providers without code changes.
vs. Lambda
Longer execution times: Lambda has a 15-minute maximum timeout. Containers run indefinitely for long-running services.
Language/framework flexibility: Lambda supports specific runtimes. Containers support any language or framework that runs on Linux/Windows.
State management: Containers handle stateful applications and long-running services better than Lambda’s ephemeral execution model.
Larger package sizes: Lambda limits deployment packages to 250 MB unzipped. Container images can be several GB.
When to Use Containers
Choose containers (ECS/EKS) when:
- Running long-running applications (web servers, microservices, APIs)
- Need specific languages/frameworks not well-supported by Lambda
- Application runs longer than 15 minutes
- Require full control over runtime environment
- Migrating existing Docker-based applications
- Running stateful applications (databases, message queues)
Choose Lambda when:
- Event-driven workloads (S3 uploads, API Gateway requests)
- Execution time under 15 minutes
- Sporadic or variable traffic patterns
- Want zero infrastructure management
Choose EC2 when:
- Need full infrastructure control (custom OS, kernel modules)
- Running specialized databases or data stores
- GPU-intensive workloads requiring specific drivers
- Migrating legacy applications requiring specific server configurations
Amazon ECS (Elastic Container Service)
ECS Architecture
ECS is AWS’s native container orchestration service, designed for simplicity and deep AWS integration.
Core components:
Cluster: Logical grouping of services and tasks. Region-specific. Can contain EC2 instances, Fargate capacity, or both.
Task Definition: JSON blueprint describing 1-10 containers that comprise your application. Specifies container images, CPU/memory requirements, port mappings, network modes, IAM roles, environment variables, and secrets.
Task: Instantiation of a task definition—smallest unit of execution in ECS. Can contain one or more containers running together on the same host.
Service: Manages desired number of tasks, ensuring they keep running. Handles scheduling, load balancer integration, auto scaling, and rolling deployments. Use services for long-running applications.
Launch Types: EC2 vs. Fargate
ECS on EC2:
- Control: Full control over instance types, OS, networking, storage
- Cost: Most cost-effective for steady-state workloads (up to 3x cheaper than Fargate with Reserved Instances)
- Responsibility: You manage patching, scaling, capacity planning
- Use case: High-volume production workloads with predictable traffic
ECS on Fargate:
- Management: AWS handles all infrastructure—no servers to manage
- Cost: Pay only for vCPU and memory used by tasks (per-second billing)
- Pricing: $0.04048 per vCPU-hour, $0.004492 per GB-memory-hour (us-east-1, 2024)
- Use case: Variable workloads, dev/test environments, microservices
ECS Managed Instances (2024 Recommendation):
- AWS fully manages EC2 instances (provisioning, patching, scaling)
- Best combination of performance, cost optimization, and operational simplicity
- Recommended for new workloads requiring EC2 launch type
Service Discovery and Load Balancing
AWS Cloud Map Service Discovery:
- Defines custom DNS names for services
- Maintains updated locations of dynamically changing resources
- DNS-based discovery with configurable TTL
- Simpler but slower failover (depends on DNS TTL)
ECS Service Connect (2024 Feature):
- Built on Cloud Map with Envoy-based sidecar proxy
- API-based discovery (faster than DNS)
- Automatic failover detection and traffic routing
- Built-in observability (logs, metrics)
- Limitation: Cannot use CodeDeploy (no blue/green deployments with CodeDeploy)
- Cost: Additional resources for sidecar containers
Load Balancer Integration:
- Application Load Balancer (ALB): Best for HTTP/HTTPS, advanced routing (path-based, host-based, header-based)
- Network Load Balancer (NLB): Best for TCP/UDP, ultra-low latency, high throughput, static IPs
- awsvpc network mode: Use “ip” target type when tasks have elastic network interfaces
Auto Scaling
Target Tracking Scaling (Recommended):
- Set target value for a metric (e.g., 70% CPU utilization)
- ECS automatically creates CloudWatch alarms and adjusts task count
- Metrics: CPU utilization, memory utilization, ALB request count per target
Step Scaling:
- Define specific thresholds and scaling actions
- React quickly to demand spikes
- Multiple steps for different alarm severity
- Example: Add 2 tasks at 70% CPU, add 5 tasks at 85% CPU
Predictive Scaling (November 2024):
- Uses machine learning to analyze historical patterns
- Scales proactively before demand spikes
- Combines with target tracking for real-time adjustments
When to Use ECS vs. EKS
Choose ECS when:
- Team has little/no Kubernetes experience
- Deploying AWS-centric workloads
- Want minimal operational overhead
- Running simple to moderate complexity microservices
- Prioritizing ease of use and deep AWS integration
- Cost efficiency critical (no control plane costs)
Choose EKS when:
- Team already has Kubernetes expertise
- Need multi-cloud or hybrid deployments
- Require fine-grained control over orchestration
- Want access to Kubernetes ecosystem (Helm, operators, CNCF tools)
- Portability is important
- Complex distributed systems requiring advanced orchestration
Not a binary decision: Both services can coexist in the same AWS account. Containers ensure portability between them.
AWS Fargate
Serverless Container Execution
Fargate is a serverless compute engine that runs containers without managing servers. You define CPU, memory, and networking requirements; AWS handles provisioning, scaling, and patching.
Fargate vs. EC2 Launch Type
| Factor | Fargate | ECS on EC2 |
|---|---|---|
| Management | Zero infrastructure management | Manage instances, patching, capacity |
| Cost (steady-state) | 3-9x more than EC2 Reserved Instances | Most cost-effective with Reserved Instances |
| Cost (variable) | Pay-per-second for task duration | Pay for instances even if underutilized |
| Startup time | ~30-60 seconds | Seconds (if instances already running) |
| Control | Limited (AWS-managed) | Full control over instances |
| Best for | Variable workloads, batch jobs, dev/test | High-volume production, GPU workloads |
Real-world cost comparison:
- Fargate: $0.04048/vCPU-hour + $0.004492/GB-hour (us-east-1)
- EC2 on-demand: ~3x cheaper
- EC2 Reserved Instances (1-year): ~6x cheaper than Fargate
- EC2 Reserved Instances (3-year): ~9x cheaper than Fargate
Fargate Pricing and Resource Configurations
Pricing (2024):
- vCPU: $0.04048 per vCPU-hour (us-east-1)
- Memory: $0.004492 per GB-hour
- Storage: 20 GB ephemeral storage included; $0.000111 per GB-hour for additional storage (up to 200 GB)
- Billing: Per-second billing with 1-minute minimum
Valid CPU/Memory Configurations:
- 0.25 vCPU: 0.5 GB, 1 GB, 2 GB memory
- 0.5 vCPU: 1 GB to 4 GB (increments of 1 GB)
- 1 vCPU: 2 GB to 8 GB
- 2 vCPU: 4 GB to 16 GB
- 4 vCPU: 8 GB to 30 GB
- 8 vCPU: 16 GB to 60 GB
- 16 vCPU: 32 GB to 120 GB
Fargate Spot
Discount: Up to 70% off Fargate on-demand pricing
Interruption: 2-minute warning before termination when AWS needs capacity back
Availability: Capacity not guaranteed
Use cases:
- Fault-tolerant workloads
- Batch processing
- Stateless services with built-in resilience
- CI/CD pipelines
Combining optimizations:
- Graviton + Spot: Up to 76% savings (20% from Graviton, 70% from Spot compounded)
- Graviton pricing: ~20% cheaper than x86 (e.g., eu-west-1: $0.03238 vs. $0.04048 per vCPU-hour)
- Graviton + Spot announcement: September 2024—Fargate Spot now supports Arm-based Graviton processors
Amazon EKS (Elastic Kubernetes Service)
Kubernetes Fundamentals on AWS
EKS runs upstream Kubernetes, ensuring compatibility with standard Kubernetes tooling and APIs. AWS manages the Kubernetes control plane (high availability, patching, upgrades) across multiple Availability Zones.
EKS Architecture
Control Plane (AWS-Managed):
- Kubernetes API server, etcd, scheduler, controller manager
- Automatically scaled and distributed across 3 AZs
- AWS handles patching, upgrades, high availability
- Cost: $0.10 per cluster-hour (~$73/month per cluster)
Worker Nodes (Customer-Managed Options):
Self-Managed Nodes:
- Full control over EC2 instances
- Manual scaling, patching, upgrades
- Most flexible but highest operational burden
Managed Node Groups (Recommended):
- AWS handles provisioning, scaling, patching
- No extra cost (only pay for EC2 instances)
- Automated updates with single operation
- Integrates with Auto Scaling Groups
Fargate:
- Serverless compute for pods
- No node management
- Per-pod pricing
- Limited features (no DaemonSets, no hostPort)
Karpenter (Advanced):
- Group-less autoscaling—works directly with EC2 Fleet API
- Responds to workload demands in under 1 minute
- Optimizes instance selection based on pod requirements
- More flexible and faster than Cluster Autoscaler
EKS Auto Mode (December 2024):
- Fully automates Kubernetes cluster management
- Handles compute, storage, networking with a single click
- Built on Karpenter
- One-click migration from Managed Node Groups or Fargate
EKS vs. ECS Decision Framework
| Factor | ECS | EKS |
|---|---|---|
| Learning curve | Low | High (requires Kubernetes knowledge) |
| Operational complexity | Minimal | Moderate to high |
| Control plane cost | Free | $73/month per cluster |
| AWS integration | Deep (native AWS service) | Standard Kubernetes integration |
| Ecosystem | Limited | Rich (Helm, operators, CNCF tools) |
| Portability | AWS-only | Multi-cloud, hybrid, on-premises |
| Use case | AWS-centric microservices | Complex distributed systems, multi-cloud |
When Kubernetes Complexity is Justified
Use EKS when:
- Existing Kubernetes expertise in the team
- Need for multi-cloud or hybrid deployments (EKS Hybrid Nodes GA December 2024)
- Require advanced orchestration features (custom controllers, operators, StatefulSets)
- Want vibrant ecosystem and community support (Helm, Prometheus, Istio)
- Migrating from on-premises Kubernetes
- Running complex stateful applications requiring persistent volumes
Example: A company running 100+ microservices with complex service mesh requirements, custom operators, and plans to migrate workloads to on-premises data centers benefits from EKS. A startup deploying 5 microservices exclusively on AWS is better served by ECS.
Container Networking
VPC Networking Modes
awsvpc (Recommended):
- Each task/pod gets its own elastic network interface (ENI)
- Tasks have their own private IP address
- Full VPC networking features (security groups, NACLs)
- Required for Fargate
- Limitation: ENI limits per instance (e.g., m5.large supports 10 ENIs)
bridge (Docker Default):
- Uses Docker’s virtual network bridge
- Port mapping required (host port → container port)
- Reduced security isolation
- Not available on Fargate
host:
- Container uses host’s network directly
- No port mapping needed
- Least isolation
- Not available on Fargate
Service Mesh
ECS Service Connect (2024):
- Managed Envoy sidecar
- Faster failover than DNS-based service discovery
- Observability built-in (CloudWatch Logs, metrics)
- Only for ECS-to-ECS communication
- Incompatible with blue/green deployments using CodeDeploy
Amazon VPC Lattice (2024 General Service Mesh):
- Eliminates sidecar proxies
- Works across ECS, EKS, Lambda, EC2
- Simplified application networking with consistent connectivity, security, and monitoring
- Preferred for cross-service communication
AWS App Mesh (Legacy):
- New onboarding stopped September 2024
- Migration paths: ECS Service Connect (ECS) or VPC Lattice (general)
Load Balancing
Application Load Balancer (ALB):
- Layer 7 (HTTP/HTTPS)
- Advanced routing (path-based, host-based, query string, header-based)
- WebSocket support
- SSL/TLS termination
- Native integration with ECS/EKS
Network Load Balancer (NLB):
- Layer 4 (TCP/UDP)
- Ultra-low latency (microseconds)
- Static IP addresses
- Millions of requests per second
- Preserves source IP
Multiple Target Groups (2024):
- ECS services can attach to multiple target groups
- Example: Internal NLB for private traffic + internet-facing ALB for public traffic
Storage for Containers
Ephemeral Storage
ECS on Fargate:
- Default: 20 GB included (free)
- Maximum: 200 GB configurable
- Cost: $0.000111 per GB-hour for additional storage
- Encryption: AES-256 for tasks launched on platform version 1.4.0+ (May 28, 2020 or later)
EKS on Fargate:
- Default: 20 GB
- Maximum: 175 GB per pod
ECS on EC2:
- Depends on instance storage (typically 10-30 GB root volume)
EBS Volumes for ECS Tasks
Major 2024 Update: ECS now supports native EBS volume integration (announced January 2024).
Use cases:
- Data-intensive workloads requiring high performance, low latency
- Block storage within a single Availability Zone
- Applications needing persistent storage that doesn’t span tasks
Availability: US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Singapore, Sydney, Tokyo), Europe (Frankfurt, Ireland, Stockholm)
EFS for Shared Persistent Storage
Amazon EFS (Elastic File System):
- Use case: Applications spanning many tasks needing concurrent access
- Availability: Multi-AZ Regional availability
- Access modes: ReadWriteMany (multiple pods/tasks can mount simultaneously)
- Supported on: ECS (EC2 and Fargate), EKS
EBS vs. EFS:
- EBS: Single-AZ, ReadWriteOnce, lower latency, higher IOPS
- EFS: Multi-AZ, ReadWriteMany, regional availability, shared access
Best practice for EKS: Deploy Amazon EBS CSI driver or Amazon EFS CSI driver via EKS add-ons for security and efficiency.
Security Best Practices
IAM Roles for Tasks and Pods
ECS Task Roles:
- Task Execution Role: Grants ECS agent permission to pull images from ECR and write logs to CloudWatch
- Task IAM Role: Grants application code access to AWS services
- Best practice: Separate roles for execution vs. application; apply least privilege
EKS IRSA (IAM Roles for Service Accounts):
- Assigns IAM roles to Kubernetes service accounts
- Pod-level permissions without sharing credentials
- Leverages AWS STS for temporary credentials (auto-rotated)
- 2024 Update: Continues to be supported alongside EKS Pod Identity
EKS Pod Identity (2023+):
- Assigns IAM roles directly to pods (decoupled from service accounts)
- Simpler trust management than IRSA
- More fine-grained control
Security warning: Pods can still inherit instance profile permissions. Always block access to instance metadata when using IRSA or Pod Identity.
Secrets Management
AWS Secrets Manager and Parameter Store:
- Store sensitive data (database passwords, API keys)
- Reference in task definitions via ARN
- Required permission:
secretsmanager:GetSecretValuein task execution role
Example (ECS Task Definition):
"secrets": [
{
"name": "DB_PASSWORD",
"valueFrom": "arn:aws:secretsmanager:region:account:secret:db-password"
}
]
Best practices:
- Never hardcode secrets in container images or environment variables
- Use Secrets Manager for secrets requiring rotation
- Use Parameter Store (SecureString) for static configuration
- Grant minimal IAM permissions for secret access
Image Scanning
Amazon Inspector (ECR Integration):
- Automatically scans images on push
- Detects vulnerabilities in OS packages and application dependencies
- Maps images to running containers (ECS tasks, EKS pods)
- Prioritizes vulnerabilities based on whether images are currently running
Best practices:
- Enable image tag immutability to prevent malicious overwrites
- Use EventBridge to trigger actions (delete insecure images, trigger rebuilds)
- Scan on every push
- Block deployment of images with critical vulnerabilities
Network Security
Security Groups:
- awsvpc mode: Assign security groups directly to tasks/pods
- Control inbound/outbound traffic at task level
- Stateful (return traffic automatically allowed)
NACLs (Network Access Control Lists):
- Subnet-level firewall rules
- Stateless (must configure inbound and outbound separately)
- Defense-in-depth layer
GuardDuty Runtime Monitoring (2023):
- Detects runtime security threats in ECS (EC2 and Fargate) and EKS
- Identifies suspicious activity, malware, unauthorized access
Runtime Security
Pod Security Standards (EKS):
- Kubernetes-native security policies (Restricted, Baseline, Privileged)
- Enforce via admission controllers (OPA Gatekeeper, Kyverno)
- Limit privileged containers (needed for system components like VPC CNI, but not application pods)
Container-Optimized OS:
- Bottlerocket: AWS-managed, immutable, minimal attack surface
- Automatically patched via managed node groups
CIS Benchmark Compliance:
- Verify EKS/ECS configurations against CIS benchmarks
- Tools: AWS Security Hub, third-party scanners
Observability
CloudWatch Container Insights
Features:
- Collects, aggregates, and summarizes metrics and logs
- Instance-level, cluster-level, and task/pod-level metrics
- Pre-built dashboards (CPU, memory, network, disk)
Enhanced ECS Observability (December 2024):
- Granular visibility into container workloads
- Proactive monitoring and faster troubleshooting
Requirements:
- ECS on EC2: Container agent 1.4.0+ (latest recommended)
- EKS: Deploy CloudWatch agent via DaemonSet or Fargate logging
Logging
awslogs Log Driver:
- Forwards stdout/stderr to CloudWatch Logs
- Simple configuration in task definition
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/my-app",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
FireLens (Fluent Bit/Fluentd):
- Routes logs to third-party services (Datadog, Splunk, Elasticsearch)
- Flexible log transformation and routing
- Sidecar container pattern
Distributed Tracing
AWS X-Ray:
- Traces requests across microservices
- Identifies performance bottlenecks, errors
- Integrates with ECS and EKS via sidecar container or daemon
ADOT (AWS Distro for OpenTelemetry):
- Collects traces and metrics using OpenTelemetry
- Sends data to CloudWatch, X-Ray, Prometheus
- Vendor-neutral instrumentation
Prometheus and Grafana (EKS)
Amazon Managed Service for Prometheus:
- Fully managed Prometheus-compatible monitoring
- Agentless metric collection for EKS (2023)
- Integrates with Grafana for visualization
Amazon Managed Grafana:
- Fully managed Grafana for dashboards
- Pre-built dashboards for EKS, ECS
Deployment Patterns
Blue/Green Deployments
ECS Native Blue/Green (2025):
- Built-in blue/green without CodeDeploy
- Can change deployment controller after service creation
- Requires ALB
- Validates new revision before routing production traffic
- Instant rollback capability
EKS Blue/Green:
- Use separate Kubernetes deployments or namespaces
- Shift traffic via service selector or ingress controller
- Tools: Flagger, Argo Rollouts
Rolling Updates
ECS Rolling Update:
- Default deployment type
- Gradually replaces tasks with new version
- Configurable:
minimumHealthyPercentandmaximumPercent - Example: 50% minimum, 200% maximum = deploy new tasks before stopping old ones
EKS Rolling Update:
- Kubernetes-native via Deployment resources
- Configurable:
maxUnavailable,maxSurge
Canary Deployments
ECS Native Canary (October 2025):
- Route small percentage of traffic to new revision
- Monitor metrics during bake time
- Gradually increase traffic
- Automatic rollback on CloudWatch alarm breach
EKS Canary:
- Use Flagger (progressive delivery tool)
- Argo Rollouts (GitOps-based canary)
CircuitBreaker Deployment (ECS)
Deployment Circuit Breaker:
- Monitors deployment health
- Stops launching new tasks if service cannot reach steady state
- Optionally rolls back to last successful deployment
- Only works with rolling update deployment type
"deploymentConfiguration": {
"deploymentCircuitBreaker": {
"enable": true,
"rollback": true
}
}
Cost Optimization
EC2 vs. Fargate Cost Comparison
Scenario: Running 10 tasks, 1 vCPU, 2 GB memory each, 24/7
| Launch Type | Configuration | Monthly Cost | Annual Cost |
|---|---|---|---|
| Fargate On-Demand | 10 vCPU, 20 GB | $356 | $4,272 |
| Fargate Spot | 10 vCPU, 20 GB | $107 (70% savings) | $1,284 |
| Fargate Graviton | 10 vCPU, 20 GB | $285 (20% savings) | $3,420 |
| Fargate Graviton + Spot | 10 vCPU, 20 GB | $86 (76% savings) | $1,032 |
| EC2 Reserved (1-year) | m5.large × 5 instances | $284 (35% savings vs On-Demand) | $3,408 |
| EC2 Reserved (3-year) | m5.large × 5 instances | $189 (57% savings vs On-Demand) | $2,268 |
Key takeaways:
- Fargate Spot + Graviton: Most cost-effective for fault-tolerant workloads ($86/month)
- EC2 Reserved (3-year): Best for steady-state, long-term workloads ($189/month)
- Fargate On-Demand: Most expensive but simplest ($356/month)
Savings Plans
Compute Savings Plans:
- 1-year: Up to 50% savings
- 3-year: Up to 66% savings
- Applies across EC2, Fargate, Lambda
- Flexible across instance families, sizes, regions
Best practice: Use Compute Savings Plans for baseline capacity, Spot for fault-tolerant workloads, On-Demand for unpredictable spikes.
Right-Sizing Containers
AWS Compute Optimizer:
- Uses machine learning to analyze utilization
- Recommends optimal CPU and memory configurations
- Customizable thresholds (CPU headroom, memory headroom)
- Lookback periods: 14, 32, or 93 days
Best practices:
- Monitor for 30 days to establish baseline
- Rightsize if max memory utilization < 40% over 4 weeks
- Use CloudWatch Container Insights for granular metrics
- EKS: Use Vertical Pod Autoscaler (VPA) for automated rightsizing
Graviton Processors
AWS Graviton2/Graviton3:
- ~20% lower cost than x86 (Intel/AMD)
- Better performance per dollar
- Supported by most popular software packages
Migration:
- Rebuild container images for ARM64 architecture
- Test compatibility (most modern software supports ARM)
- Potential effort: Moderate (rebuilding images, testing)
Fargate Graviton + Spot (September 2024):
- Combine 20% Graviton savings with 70% Spot discount
- Total: Up to 76% savings vs. Fargate On-Demand
Service Selection Framework
Decision Matrix
| Use Case | Recommended Service | Rationale |
|---|---|---|
| Simple microservices, AWS-centric | ECS on Fargate | Minimal management, deep AWS integration |
| High-volume production, cost-critical | ECS on EC2 with Reserved Instances | Most cost-effective for steady-state |
| Kubernetes ecosystem required | EKS with Managed Node Groups | Standard Kubernetes, rich tooling |
| Multi-cloud, hybrid deployments | EKS with Hybrid Nodes | Portability, unified management |
| Variable workloads, dev/test | ECS on Fargate | Pay only for usage, no idle costs |
| Batch processing, fault-tolerant | Fargate Spot or EC2 Spot | Up to 70-90% cost savings |
| GPU workloads, custom kernels | ECS on EC2 or EKS on EC2 | Full control over instances |
Specific Scenarios
E-commerce platform (200 tasks running 24/7):
- Recommendation: ECS on EC2 with Reserved Instances
- Rationale: Steady-state workload; EC2 Reserved (3-year) saves ~$20,000/year vs. Fargate
Startup with 10 microservices, unpredictable traffic:
- Recommendation: ECS on Fargate
- Rationale: No capacity planning, automatic scaling, pay only for usage
Financial services (300+ microservices, multi-cloud strategy):
- Recommendation: EKS
- Rationale: Kubernetes provides consistent experience across AWS, Azure, on-premises
Data processing pipeline (batch jobs):
- Recommendation: Fargate Spot
- Rationale: 70% cost savings, fault-tolerant workloads
Common Pitfalls
Over-Provisioning Resources
Problem: Allocating too much CPU/memory wastes money.
Example: Task configured with 2 vCPU but only using 0.5 vCPU wastes $0.03/hour ($22/month per task).
Solution:
- Use AWS Compute Optimizer for rightsizing recommendations
- Monitor actual utilization for 30 days
- Start conservative, scale up as needed
Not Using Fargate Spot
Problem: Running fault-tolerant workloads on Fargate On-Demand pays 3x more than necessary.
Solution:
- Identify workloads that tolerate interruptions (batch jobs, CI/CD, stateless services)
- Use Fargate Spot for up to 70% savings
- Implement retry logic for interrupted tasks
Gotcha: Fargate Spot capacity not guaranteed—have fallback to On-Demand if Spot unavailable.
Improper Health Checks
Problem: Missing or misconfigured health checks cause endless restart loops.
Common issues:
- Health check command not included in container image
- Timeout too short (check executing longer than timeout allows)
- Retry count too low (transient failures mark container unhealthy)
Best practices:
- Test health check commands locally
- Set
intervalto 30 seconds,timeoutto 5 seconds,retriesto 3 - Use
/healthor/healthzendpoints for HTTP-based checks
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
Missing Auto-Scaling Configuration
Problem: Services cannot handle traffic spikes or waste resources during low traffic.
Example: Service configured with 10 tasks constantly, but traffic varies 5x throughout the day. Auto-scaling (min 2, max 20) saves $150/month.
Solution:
- Configure target tracking scaling (70% CPU utilization)
- Set reasonable min/max task counts
- Test scaling behavior under load
Kubernetes Over-Engineering
Problem: Choosing EKS for simple workloads adds unnecessary complexity and cost.
Costs of EKS:
- Control plane: $73/month per cluster
- Operational burden: Managing Kubernetes manifests, namespaces, RBAC, CRDs
- Learning curve: Requires Kubernetes expertise
When to avoid EKS:
- Team has no Kubernetes experience
- Running simple microservices (fewer than 20 services)
- No need for Kubernetes ecosystem
- AWS-only deployment
Example: Team of 3 developers deploying 5 microservices chose EKS because “Kubernetes is the industry standard.” Spent 6 months learning Kubernetes, fighting YAML configuration errors, debugging networking issues. ECS would have taken 1 week to set up.
Not Blocking Instance Metadata Access
Problem: Pods/tasks inherit instance profile permissions, violating least privilege.
Solution:
- Use IRSA (EKS) or Task IAM Roles (ECS) for fine-grained permissions
- Block IMDS access via network policy or firewall rules
- ECS: Set
"disableNetworking": truefor task definition - EKS: Use network policies to block 169.254.169.254
Using Untagged Images
Problem: Untagged images accumulate, wasting ECR storage costs.
Solution:
- Implement ECR lifecycle policies
- Expire untagged images after 30 days
- Keep last 30 tagged images per repository
{
"rules": [
{
"rulePriority": 1,
"description": "Expire untagged images after 30 days",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 30
},
"action": {
"type": "expire"
}
}
]
}
Key Takeaways
1. Choose the right service based on expertise and requirements: ECS for simplicity and AWS integration ($0 control plane cost). EKS for Kubernetes ecosystem and portability ($73/month per cluster). Let team expertise and portability needs guide the decision.
2. Fargate vs EC2 depends on workload patterns: Fargate excels at variable workloads and eliminates infrastructure management (pay-per-second). EC2 with Reserved Instances is 3-9x cheaper for steady-state workloads. Use Fargate Spot + Graviton for up to 76% savings on fault-tolerant workloads.
3. Container networking matters for security and performance: Use awsvpc mode for task-level security groups (required for Fargate). Use ECS Service Connect or VPC Lattice for service-to-service communication. ALB for HTTP/HTTPS, NLB for TCP/UDP ultra-low latency.
4. Storage depends on access patterns and availability needs: Ephemeral storage (20-200 GB) for temporary data. EBS for high-performance single-AZ persistent storage. EFS for shared multi-AZ persistent storage accessible by multiple tasks.
5. Security requires multiple layers: Use IAM roles for tasks/pods (not instance profiles). Store secrets in Secrets Manager or Parameter Store (not environment variables). Enable ECR image scanning with Amazon Inspector. Block instance metadata access. Use GuardDuty Runtime Monitoring for threat detection.
6. Observability is critical for troubleshooting: Enable CloudWatch Container Insights for metrics. Use awslogs or FireLens for centralized logging. Use X-Ray or ADOT for distributed tracing. For EKS, integrate Amazon Managed Prometheus and Grafana.
7. Deployment patterns enable zero-downtime releases: Use ECS native blue/green or canary deployments (2025 features). Enable CircuitBreaker for automatic rollback on failures. Configure rolling updates with appropriate minimumHealthyPercent and maximumPercent.
8. Cost optimization requires multiple strategies: Use Compute Savings Plans (up to 66% savings) for baseline capacity. Use Fargate Spot (70% savings) or EC2 Spot (90% savings) for fault-tolerant workloads. Use Graviton processors (20% cheaper). Rightsize containers with Compute Optimizer. Implement ECR lifecycle policies.
9. Auto-scaling prevents both under-provisioning and waste: Use target tracking scaling (70% CPU recommended). Enable Predictive Scaling (November 2024) for machine learning-based forecasting. Test scaling behavior under load. Set reasonable min/max task counts.
10. Avoid common pitfalls: Don’t over-provision resources (use Compute Optimizer). Don’t ignore Fargate Spot for fault-tolerant workloads (70% savings). Don’t use shallow health checks (verify application health, not just instance responsiveness). Don’t choose EKS for simple workloads when ECS suffices. Don’t forget to block instance metadata access when using task/pod IAM roles.
Recent 2024-2025 improvements: ECS Managed Instances (recommended for new workloads). EKS Auto Mode (December 2024). ECS native blue/green and canary deployments (October 2025). Fargate Spot with Graviton support (September 2024). Enhanced ECS Observability (December 2024). EBS volume support for ECS tasks (January 2024). VPC Lattice general availability (cross-service networking).
Found this guide helpful? Share it with your team:
Share on LinkedIn