Advanced Container Patterns on Azure
What Are Advanced Container Patterns
Azure Kubernetes Service (AKS) provides managed Kubernetes, but production-grade container platforms require patterns that address observability, security, cost optimization, and operational complexity at scale. Advanced AKS patterns leverage Azure-native integrations like Azure CNI networking modes, workload identity, KEDA autoscaling, and GitOps tooling to build resilient, observable, and cost-effective container platforms.
These patterns extend beyond getting workloads running in Kubernetes to addressing the challenges that emerge when operating container platforms supporting multiple teams, diverse workload types, and stringent reliability requirements.
What Problems Advanced Container Patterns Solve
Without advanced patterns:
- Basic networking limits security isolation and multi-tenancy
- Manual scaling strategies cannot respond to event-driven workload spikes
- Observability gaps make troubleshooting microservices failures difficult
- Manual deployment processes create drift between environments
- Cost management relies on guesswork and reactive rightsizing
- Securing pod-to-Azure service authentication requires complex credential management
With advanced patterns:
- Sophisticated networking provides workload isolation, policy enforcement, and optimized pod IP allocation
- Event-driven autoscaling responds to queue depth, custom metrics, and external event sources
- Service mesh provides built-in observability, traffic management, and zero-trust security
- GitOps ensures declarative, auditable deployments with drift detection
- Workload identity eliminates credential management by leveraging Azure Entra ID
- Multi-cluster patterns distribute workloads across regions and failure domains
How Azure AKS Differs from AWS EKS
Architects familiar with AWS EKS should understand several key differences in how Azure approaches advanced Kubernetes patterns:
| Concept | AWS EKS | Azure AKS |
|---|---|---|
| Pod networking | AWS VPC CNI assigns VPC IPs to pods (consumes VPC address space) | Azure CNI Overlay uses separate pod CIDR (conserves VNet space); Azure CNI Powered by Cilium adds eBPF capabilities |
| Serverless pods | Fargate profiles for serverless pod execution | Virtual Nodes (Azure Container Instances burst capacity) |
| Node autoscaling | Cluster Autoscaler + Karpenter (AWS-native provisioner) | Cluster Autoscaler + Node Auto-Provisioning (preview); no Karpenter support yet |
| Service mesh | AWS App Mesh (managed) + community options | Istio-based Service Mesh add-on (managed), OSM, Linkerd |
| Workload identity | IAM Roles for Service Accounts (IRSA) | Workload Identity (Azure Entra ID federation) |
| GitOps | Flux via EKS add-on | Flux v2 via AKS GitOps extension |
| Event-driven autoscaling | Manual KEDA installation or third-party | KEDA add-on (managed, integrated) |
| Multi-cluster | EKS Connector, third-party tools | Azure Fleet Manager (managed multi-cluster orchestration) |
AKS Advanced Networking
Networking Models Overview
AKS supports four networking models, each with different trade-offs around IP consumption, performance, and operational complexity.
| Networking Model | Pod IPs Source | VNet IP Consumption | Performance | Use Case |
|---|---|---|---|---|
| kubenet | Overlay network (10.244.0.0/16 default) | Only node IPs | Good | Development, small clusters, IP-constrained environments |
| Azure CNI | VNet address space | Node + pod IPs | Best | Production workloads needing direct pod-to-VNet connectivity |
| Azure CNI Overlay | Separate pod CIDR overlay | Only node IPs | Good | Large clusters needing VNet IP conservation |
| Azure CNI Powered by Cilium | Separate pod CIDR overlay | Only node IPs | Best (eBPF-optimized) | Advanced network policy, observability, performance |
Azure CNI Overlay
Azure CNI Overlay separates the pod IP address space from the VNet, using an overlay network for pod-to-pod communication while preserving direct VNet integration for nodes and services.
How it works:
- Nodes get IPs from the VNet subnet as usual
- Pods get IPs from a separate CIDR range (e.g., 10.244.0.0/16) not part of the VNet
- Kubernetes services can still receive VNet IPs via load balancers
- Overlay network routes pod traffic through the nodeβs VNet interface
Advantages over standard Azure CNI:
- Supports up to 250 nodes and 100,000 pods per cluster (standard Azure CNI caps at 400 pods per node based on subnet size)
- Conserves VNet address space; large clusters do not exhaust subnet IPs
- Faster cluster scaling because pod IPs do not require VNet IP allocation
- Compatible with existing VNet configurations
Trade-offs compared to standard Azure CNI:
- Pods are not directly routable from outside the cluster (requires NodePort, LoadBalancer, or Ingress)
- Some advanced VNet features like Private Endpoints integrated at the pod level may have limitations
- Slight performance overhead from overlay encapsulation
Azure CNI Powered by Cilium
Azure CNI Powered by Cilium combines Azure CNI Overlay with Ciliumβs eBPF-based data plane for enhanced performance, observability, and security capabilities.
What Cilium adds:
- eBPF-accelerated networking: Packet processing in the kernel, bypassing iptables overhead
- Enhanced network policies: Layer 7 (HTTP, gRPC, Kafka) policy enforcement in addition to Layer 3/4
- Deep observability: Hubble provides flow visualization, service dependency mapping, and DNS observability
- Transparent encryption: WireGuard-based encryption between pods with minimal performance impact
- Advanced load balancing: Maglev consistent hashing for service load balancing
When to use Azure CNI Powered by Cilium:
- High-performance workloads where iptables overhead is a bottleneck
- Security requirements for Layer 7 network policies (e.g., only allow GET requests to specific paths)
- Observability requirements for fine-grained service-to-service traffic analysis
- Multi-cluster service mesh architectures leveraging Cilium ClusterMesh
- Zero-trust networking requirements with transparent pod-to-pod encryption
Trade-offs:
- More complex troubleshooting when issues arise (eBPF debugging requires specialized knowledge)
- Newer technology with less operational maturity than standard Azure CNI
- Some Kubernetes network policy features behave differently with Ciliumβs extended policy model
Network Policy Engines
Kubernetes network policies define rules for pod-to-pod traffic. AKS supports multiple policy engines.
| Engine | Performance | Features | Complexity | Use Case |
|---|---|---|---|---|
| Azure Network Policies | Good | Basic L3/L4 policies | Low | Simple production workloads |
| Calico | Good | Advanced L3/L4 policies, global network policy, egress controls | Medium | Enterprise security requirements |
| Cilium | Best (eBPF) | L7 policies, observability, encryption | High | High-performance workloads with advanced security |
Recommendation: Start with Azure Network Policies for simplicity. Upgrade to Calico when you need advanced policy features like global policy or egress gateway. Choose Cilium when performance, observability, or Layer 7 policy enforcement is critical.
Service Mesh Integration
What a Service Mesh Provides
A service mesh adds observability, traffic management, and security capabilities to microservices communication without changing application code. The mesh intercepts network traffic between services using sidecar proxies deployed alongside each pod.
Core service mesh capabilities:
- Observability: Automatic metrics, logs, and distributed traces for all service-to-service calls
- Traffic management: Canary deployments, traffic splitting, circuit breaking, retries, timeouts
- Security: Mutual TLS (mTLS) between services, certificate management, fine-grained authorization policies
- Resilience: Automatic retries, outlier detection, connection pooling, load balancing
Service Mesh Options on AKS
Azure supports three primary service mesh options:
| Service Mesh | Management | Maturity | Complexity | Use Case |
|---|---|---|---|---|
| Istio-based Service Mesh add-on | Managed by Azure | Stable | High | Production workloads needing comprehensive traffic control |
| Open Service Mesh (OSM) | Community-maintained | Deprecated (EOL 2024) | Medium | Legacy; migrate to Istio or Linkerd |
| Linkerd | Self-managed | Stable | Medium | Lightweight mesh for simpler use cases |
Istio-based Service Mesh Add-on
The Istio-based Service Mesh add-on provides a managed Istio installation where Azure handles control plane upgrades, patching, and lifecycle management.
How it works:
- Azure manages the Istio control plane (istiod) as a system workload
- Sidecar injection is automatic when you label namespaces (e.g.,
istio-injection=enabled) - Envoy proxies intercept all pod network traffic
- Configuration uses standard Istio APIs (VirtualService, DestinationRule, Gateway, etc.)
Key features:
- Ingress Gateway: Managed ingress gateway for external traffic routing
- Egress Gateway: Controlled egress for outbound traffic to external services
- Certificate management: Automatic certificate rotation for mTLS
- Integration with Azure Monitor: Telemetry flows to Azure Monitor for centralized observability
- Multi-cluster support: Federate service mesh across multiple AKS clusters
When to use the Istio add-on:
- Complex microservices architectures with sophisticated traffic routing needs
- Security requirements for zero-trust mTLS between all services
- Canary deployments, A/B testing, or blue-green deployments at the network layer
- Integration with Azure PaaS services (Application Gateway, Azure Monitor)
Trade-offs:
- Resource overhead: Each pod gets an Envoy sidecar, increasing memory and CPU usage
- Increased latency from sidecar proxy hops (typically 1-5ms per hop)
- Steep learning curve for Istio APIs and concepts
- Debugging complexity when traffic behavior differs from expectations
Linkerd
Linkerd is a lightweight, CNCF-graduated service mesh focused on simplicity and performance.
Advantages over Istio:
- Lower resource overhead (smaller, more efficient proxies)
- Simpler architecture with fewer moving parts
- Faster control plane and data plane
- Easier to learn and operate
Disadvantages compared to Istio:
- Fewer advanced traffic management features
- Smaller ecosystem and community compared to Istio
- Self-managed on AKS; no Azure-managed add-on
When to use Linkerd:
- Simpler microservices architectures where basic mTLS and observability suffice
- Resource-constrained environments where sidecar overhead matters
- Teams prioritizing operational simplicity over feature richness
When You Do NOT Need a Service Mesh
Service meshes add complexity and overhead. Consider whether you need one before adopting.
You may not need a service mesh if:
- Your application is a monolith or has few services (service mesh overhead exceeds benefit)
- You already have observability through APM tools (Datadog, New Relic, Application Insights)
- Your services are stateless and failures are handled by Kubernetes retries and readiness probes
- Network security requirements are met by network policies alone
Alternatives to service mesh:
- Application-level instrumentation: OpenTelemetry SDKs provide observability without a mesh
- API Gateway: Azure API Management or open-source gateways provide traffic management at the edge
- Network policies: Cilium or Calico network policies enforce pod-to-pod security without mTLS overhead
Event-Driven Autoscaling with KEDA
What Is KEDA
KEDA (Kubernetes Event-Driven Autoscaling) extends Kubernetes Horizontal Pod Autoscaler (HPA) to scale workloads based on external event sources like message queues, databases, and custom metrics.
Azure AKS provides KEDA as a managed add-on, eliminating manual installation and maintenance.
How KEDA Works
Standard Kubernetes HPA scales based on CPU and memory metrics. KEDA extends HPA to scale based on external metrics from dozens of sources.
KEDA components:
- ScaledObject: Custom resource defining autoscaling rules and trigger metrics
- ScaledJob: Scales Kubernetes Jobs based on event sources
- Scaler: Plugin for a specific event source (Azure Service Bus, Azure Storage Queue, Kafka, Prometheus, etc.)
- Metrics Adapter: Exposes external metrics to HPA
Scaling behavior:
- KEDA monitors the event source via the scaler
- When the metric threshold is met (e.g., queue depth > 10), KEDA instructs HPA to scale pods
- When the metric falls to zero, KEDA can scale to zero pods (not possible with standard HPA)
Common KEDA Scalers for Azure
| Scaler | Event Source | Use Case |
|---|---|---|
| Azure Service Bus Queue | Queue message count or age | Background job processing, async tasks |
| Azure Service Bus Topic | Subscription message count | Event-driven microservices |
| Azure Storage Queue | Queue message count | Simple job queues without Service Bus features |
| Azure Blob Storage | Blob count in container | Batch processing of uploaded files |
| Azure Event Hubs | Unprocessed event count (lag) | Stream processing workloads |
| Prometheus | Custom application metrics | Scaling based on business metrics (active users, pending orders) |
| RabbitMQ | Queue length | Open-source message queue workloads |
| Kafka | Consumer group lag | Event streaming platforms |
Example: Scaling Based on Azure Service Bus Queue
Scenario: Pods process messages from an Azure Service Bus queue. When the queue depth exceeds 10 messages, scale up; when empty, scale to zero.
ScaledObject definition:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor-scaler
namespace: production
spec:
scaleTargetRef:
name: order-processor
minReplicaCount: 0
maxReplicaCount: 30
triggers:
- type: azure-servicebus
metadata:
queueName: orders
namespace: mycompany-servicebus
messageCount: "10"
authenticationRef:
name: azure-servicebus-auth
How this works:
- KEDA monitors the
ordersqueue inmycompany-servicebus - When the queue has more than 10 messages, KEDA scales the
order-processordeployment - Scaling is proportional: 100 messages scales to approximately 10 pods (100 / 10)
- When the queue empties, KEDA scales to zero, eliminating idle pod costs
When to Use KEDA
KEDA is valuable for:
- Background job processing with unpredictable workload spikes
- Event-driven architectures where work arrives in bursts
- Cost optimization by scaling to zero during idle periods
- Queue-based decoupling between frontend and backend services
Do not use KEDA for:
- Workloads requiring constant availability (use standard HPA with min replicas > 0)
- Workloads where cold-start latency is unacceptable (scaling from zero takes time)
- Simple CPU-based scaling where standard HPA suffices
GitOps with Flux v2
What Is GitOps
GitOps uses Git as the single source of truth for declarative infrastructure and application definitions. Changes are made via pull requests, and automation reconciles the cluster state with the Git repository state.
GitOps principles:
- Declarative: Infrastructure and application state is expressed declaratively (YAML manifests)
- Versioned and immutable: Git commits provide version history and immutability
- Pulled automatically: Operators running in the cluster pull changes from Git
- Continuously reconciled: Operators detect drift and restore desired state
Flux v2 on AKS
The AKS GitOps extension provides a managed Flux v2 installation for continuous delivery to AKS clusters.
Flux v2 components:
- Source Controller: Fetches manifests from Git, Helm repositories, S3 buckets
- Kustomize Controller: Applies Kustomize overlays to manifests
- Helm Controller: Manages Helm chart installations
- Notification Controller: Sends alerts and integrates with external systems
How Flux works on AKS:
- Configure a
FluxConfigurationresource pointing to a Git repository - Azure provisions Flux controllers in the cluster
- Flux pulls manifests from the Git repository at regular intervals
- Flux applies changes to the cluster using Kustomize or Helm
- Flux detects drift and reconciles cluster state with Git
Example GitOps Workflow
Repository structure:
my-app-gitops/
βββ base/
β βββ deployment.yaml
β βββ service.yaml
β βββ kustomization.yaml
βββ overlays/
β βββ dev/
β β βββ kustomization.yaml
β βββ staging/
β β βββ kustomization.yaml
β βββ prod/
β βββ kustomization.yaml
FluxConfiguration for production:
apiVersion: fluxcd.io/v1
kind: FluxConfiguration
metadata:
name: my-app-prod
spec:
sourceRef:
kind: GitRepository
name: my-app-gitops
namespace: flux-system
path: overlays/prod
prune: true
interval: 5m
Workflow:
- Developers change application manifests in the
my-app-gitopsrepository - Pull request is reviewed and merged to the main branch
- Flux detects the commit within 5 minutes
- Flux applies the new manifests to the production cluster
- If someone manually changes the cluster (e.g., kubectl edit), Flux reverts it to match Git
Benefits of GitOps on AKS
Auditability: Every change is a Git commit with author, timestamp, and review history
Rollback: Revert a Git commit to roll back a deployment
Consistency: All environments are declaratively defined and continuously reconciled across development, staging, and production
Security: Cluster credentials are not required for deployments; Flux pulls from Git, developers never kubectl apply directly
Multi-cluster: Deploy to multiple AKS clusters from a single Git repository with branch-based or directory-based separation
GitOps Anti-Patterns
Do not use GitOps for:
- Secrets management (use Azure Key Vault with External Secrets Operator or Sealed Secrets)
- Storing generated or templated files that change frequently (use Helm or Kustomize to generate manifests dynamically)
- Monolithic repositories that become bottlenecks for multiple teams (separate repositories by team or workload)
Workload Identity
What Is Workload Identity
Workload Identity enables Kubernetes service accounts to authenticate to Azure services using Azure Entra ID (formerly Azure Active Directory), eliminating the need to manage credentials in pods.
Workload Identity replaces the deprecated Pod Identity model with a simpler, more secure federation-based approach.
How Workload Identity Works
- Create an Azure managed identity with permissions to access Azure resources (e.g., Storage, Key Vault)
- Establish a federated identity credential linking the managed identity to a Kubernetes service account
- Annotate the Kubernetes service account with the managed identityβs client ID
- Pods using the service account automatically receive an Azure token for authentication
Under the hood:
- AKS workload identity uses OpenID Connect (OIDC) federation
- The AKS cluster issues a Kubernetes service account token to the pod
- Azure Entra ID exchanges the Kubernetes token for an Azure access token
- The pod uses the Azure token to authenticate to Azure services
No credentials are stored in the cluster. Authentication relies on trust between the AKS OIDC issuer and Azure Entra ID.
Example: Pod Accessing Azure Key Vault
Azure setup:
- Create a managed identity:
my-app-identity - Grant the identity access to Key Vault:
Key Vault Secrets Userrole - Create a federated credential for the identity:
- Issuer: AKS OIDC issuer URL (e.g.,
https://eastus.oic.prod-aks.azure.com/tenant-id/issuer-id/) - Subject:
system:serviceaccount:production:my-app
- Issuer: AKS OIDC issuer URL (e.g.,
Kubernetes setup:
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-app
namespace: production
annotations:
azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789abc"
Pod using workload identity:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
namespace: production
spec:
template:
metadata:
labels:
azure.workload.identity/use: "true"
spec:
serviceAccountName: my-app
containers:
- name: app
image: mycompany/my-app:latest
env:
- name: AZURE_CLIENT_ID
value: "12345678-1234-1234-1234-123456789abc"
What happens at runtime:
- The pod receives a Kubernetes service account token
- The Azure SDK exchanges the Kubernetes token for an Azure access token
- The pod authenticates to Key Vault using the Azure token
- No secrets, passwords, or connection strings in the pod
Workload Identity vs Pod Identity
| Aspect | Workload Identity | Pod Identity (deprecated) |
|---|---|---|
| Authentication | OIDC federation | Managed Identity assigned to VMSS |
| Credential storage | None | None |
| Setup complexity | Lower (fewer components) | Higher (NMI daemonset, aad-pod-identity) |
| Security | Better (OIDC standard) | Good |
| Status | Recommended, actively maintained | Deprecated |
Migration: If using Pod Identity, migrate to Workload Identity. Microsoft provides a migration guide and tooling.
Node Pool Strategies
System vs User Node Pools
AKS distinguishes between system node pools (for AKS system components) and user node pools (for application workloads).
| Pool Type | Purpose | Characteristics |
|---|---|---|
| System node pool | Runs CoreDNS, metrics-server, kube-proxy, tunnelfront | Must have at least one node, cannot be deleted, and is tainted to repel user pods |
| User node pool | Runs application workloads | Can scale to zero, can be deleted, and has no special taints |
Best practice: Separate system and user node pools. Use a small system pool with reliable VM SKUs and larger user pools for application workloads. This prevents resource contention from affecting cluster stability.
Spot Node Pools
Spot node pools use Azure Spot VMs, offering significant cost savings in exchange for eviction risk when Azure needs capacity.
Characteristics:
- Spot VMs can be evicted with 30 seconds notice
- Pricing varies based on demand; typically 60-90% cheaper than on-demand VMs
- AKS gracefully drains spot nodes before eviction when possible
When to use spot node pools:
- Batch processing workloads that can tolerate interruptions
- Stateless web applications with multiple replicas
- CI/CD build agents
- Data processing pipelines with checkpointing
Do not use spot pools for:
- Stateful workloads without graceful shutdown handling
- Latency-sensitive workloads requiring consistent availability
- Single-replica deployments where eviction causes downtime
Best practice: Use pod topology spread constraints or pod anti-affinity to distribute replicas across spot and on-demand node pools, ensuring at least some replicas survive spot evictions.
GPU Node Pools
AKS supports GPU-enabled VMs for machine learning inference, training, and high-performance computing workloads.
Common GPU VM series:
| VM Series | GPU | Use Case |
|---|---|---|
| NC-series | NVIDIA Tesla | ML training, HPC |
| ND-series | NVIDIA Volta, Ampere | Large-scale deep learning |
| NV-series | NVIDIA Tesla M60 | Visualization, rendering |
GPU node pool considerations:
- GPU drivers and device plugins are installed automatically by AKS
- Pods must request GPU resources in resource limits to be scheduled on GPU nodes
- Taints and tolerations prevent non-GPU pods from wasting GPU node capacity
- Cost is significantly higher than CPU-only nodes; use node autoscaler to scale to zero when idle
Virtual Nodes (Azure Container Instances Burst)
Virtual Nodes enable AKS to burst workloads to Azure Container Instances (ACI) when cluster capacity is exhausted or for ultra-fast scaling.
How virtual nodes work:
- A virtual node appears as a node in the cluster with massive capacity
- Pods scheduled to the virtual node run in ACI, not on VMs
- Pods start in seconds without waiting for VM provisioning
- Billed per-second based on CPU and memory consumption
When to use virtual nodes:
- Burst workloads exceeding cluster capacity
- Event-driven jobs requiring instant scale-out
- Cost optimization for infrequent, short-lived workloads
Limitations:
- Not all Kubernetes features are supported (host networking, DaemonSets, stateful workloads)
- Networking configuration is more complex (requires subnet delegation)
- Some Azure integrations (like managed identities) require additional configuration
Cluster Autoscaling
Cluster Autoscaler
The Cluster Autoscaler automatically adjusts the number of nodes in a node pool based on pod scheduling and resource utilization.
How it works:
- Pods enter a pending state due to insufficient cluster capacity
- Cluster Autoscaler detects unschedulable pods
- Autoscaler adds nodes to the node pool
- When nodes are underutilized for a period (default 10 minutes), Autoscaler removes them
Configuration parameters:
| Parameter | Purpose | Typical Value |
|---|---|---|
--min-count |
Minimum nodes in pool | 1 (system pool), 0 (user pool) |
--max-count |
Maximum nodes in pool | Based on workload requirements |
--scale-down-delay-after-add |
Wait time before scale-down after scale-up | 10 minutes |
--scale-down-utilization-threshold |
CPU/memory usage threshold for scale-down | 0.5 (50%) |
Best practices:
- Set min-count to 0 for user node pools to reduce costs during idle periods
- Set max-count based on quota limits and cost constraints
- Use pod disruption budgets (PDBs) to control scale-down behavior and prevent service disruption
- Configure node pool taints and tolerations to ensure correct workload placement
Node Auto-Provisioning (Preview)
Node Auto-Provisioning (NAP) is Azureβs next-generation autoscaling feature, inspired by AWS Karpenter.
How NAP differs from Cluster Autoscaler:
- Cluster Autoscaler scales existing node pools; NAP provisions new node pools dynamically
- NAP selects optimal VM SKUs based on pod requirements (CPU, memory, GPU, architecture)
- NAP can consolidate workloads onto fewer nodes for cost optimization
- NAP responds faster to scaling events because it is not limited to predefined node pools
Current status: Preview feature; not recommended for production yet. Cluster Autoscaler remains the stable choice.
Watch for: NAP reaching GA in 2026, at which point it will become the recommended autoscaling solution for AKS.
Multi-Cluster Patterns
Azure Fleet Manager
Azure Fleet Manager provides centralized management for multiple AKS clusters across regions and subscriptions.
Core capabilities:
- Cluster orchestration: Manage cluster upgrades, configuration, and policy across fleets
- Multi-cluster services: Load balance traffic across clusters with cross-cluster service discovery
- GitOps at scale: Apply Flux configurations to multiple clusters from a single definition
- Resource propagation: Distribute resources (ConfigMaps, Secrets, Custom Resources) to clusters
Use cases for Fleet Manager:
| Scenario | How Fleet Helps |
|---|---|
| Multi-region deployments | Deploy applications to clusters in multiple regions from a central control plane |
| Disaster recovery | Distribute workloads across regions with automated failover |
| Staging and production | Manage cluster configuration across environments |
| Global traffic routing | Route user requests to the nearest regional cluster |
Fleet resource propagation example:
apiVersion: fleet.azure.com/v1alpha1
kind: ClusterResourcePlacement
metadata:
name: deploy-app-to-all-regions
spec:
resourceSelectors:
- group: apps
kind: Deployment
name: my-app
namespace: production
policy:
placementType: PickAll
What this does:
- Fleet Manager replicates the
my-appdeployment to all clusters in the fleet - Changes to the deployment in the hub cluster propagate to member clusters
- If a cluster is unhealthy, Fleet removes it from the placement
Cross-Cluster Service Discovery
Fleet Manager supports service discovery across clusters, allowing pods in one cluster to call services in another.
How it works:
- Export a service from one cluster using
ServiceExport - Fleet Manager creates a
ServiceImportin other clusters - DNS entries are created for cross-cluster service calls
- Traffic flows through Azure load balancers spanning clusters
Trade-offs:
- Latency increases for cross-cluster calls (typically 5-20ms depending on regions)
- Network costs apply for inter-region traffic
- Service mesh complexity increases when spanning multiple clusters
When to use cross-cluster services:
- Global applications requiring low-latency local processing with occasional cross-region calls
- Disaster recovery scenarios where services fail over to a secondary cluster
- Multi-region data processing pipelines where stages run in different clusters
Confidential Containers
What Are Confidential Containers
Confidential containers on AKS run workloads in hardware-based Trusted Execution Environments (TEEs), protecting data in use from the host OS, hypervisor, and even Azure administrators.
How confidential containers work:
- Containers run inside AMD SEV-SNP or Intel SGX enclaves
- Memory is encrypted and isolated from the host
- Attestation verifies that the workload is running in a genuine TEE before processing sensitive data
Use cases:
- Processing highly sensitive data like financial records, healthcare information, or government secrets
- Multi-party computation where parties do not trust each otherβs infrastructure
- Regulatory compliance requirements for data protection at rest, in transit, and in use
Trade-offs:
- Limited VM SKU availability (DCasv5, DCadsv5 series)
- Performance overhead from encryption (typically 10-20% depending on workload)
- Smaller memory limits compared to general-purpose VMs
- Additional complexity in application deployment and key management
Confidential containers are a specialized use case. Most workloads do not require TEE-level protection and should use standard AKS node pools.
Cost Optimization Strategies
Spot Instances for Non-Critical Workloads
Spot node pools reduce costs by 60-90% compared to on-demand VMs. Combine spot and on-demand pools with topology spread constraints to balance cost and availability.
Cluster Start/Stop
AKS supports stopping and starting clusters to eliminate compute costs during non-business hours.
When cluster stop makes sense:
- Development and test clusters used only during business hours
- Batch processing clusters that run on a schedule
- Demo environments used infrequently
Limitations:
- Start time is 5-15 minutes depending on cluster size
- Not suitable for production workloads requiring continuous availability
Right-Sizing with Vertical Pod Autoscaler
Vertical Pod Autoscaler (VPA) recommends and automatically adjusts pod CPU and memory requests based on observed usage.
How VPA helps with cost optimization:
- Prevents over-provisioning by reducing requests for idle resources
- Prevents under-provisioning that causes OOMKills and throttling
- Continuously tunes requests as workload patterns change
VPA modes:
- Recommendation only: VPA suggests optimal requests but does not apply them
- Auto: VPA updates pod requests and restarts pods to apply changes
- Initial: VPA sets requests at pod creation but does not modify running pods
Trade-off: Auto mode restarts pods, causing brief downtime. Use recommendation mode for production workloads and apply changes during maintenance windows.
Reserved Instances for Stable Workloads
Azure Reserved Instances provide up to 72% savings for workloads running continuously for one or three years.
When to use Reserved Instances:
- Predictable workloads with known capacity requirements
- System node pools that run continuously
- Minimum baseline capacity for user node pools
When NOT to use Reserved Instances:
- Dynamic workloads with unpredictable scaling patterns
- Short-lived projects or proof-of-concepts
- Node pools expected to change VM SKUs frequently
Monitoring Cost with Azure Cost Management
Enable Azure Cost Management to track AKS costs by node pool, workload namespace, and resource type.
Key metrics to monitor:
- Cost per node pool (identify expensive GPU or high-memory pools)
- Cost per namespace (chargeback to teams)
- Idle resource waste (nodes with low utilization)
- Spot instance savings vs eviction frequency
Common Pitfalls
Pitfall 1: Using Standard Azure CNI for Large Clusters Without Sufficient IP Addresses
Problem: Deploying a large AKS cluster with standard Azure CNI in a subnet without enough IP addresses. Azure CNI assigns a VNet IP to every pod, consuming IP space rapidly.
Result: Cluster cannot scale because the subnet runs out of IPs. Pods enter a pending state with IP allocation errors.
Solution: Use Azure CNI Overlay or Azure CNI Powered by Cilium for large clusters. Calculate IP requirements as (max pods per node) x (max nodes) + node IPs before provisioning. Reserve a /22 or larger subnet for standard Azure CNI clusters.
Pitfall 2: Deploying Service Mesh Without Understanding Resource Overhead
Problem: Installing a service mesh like Istio on a cluster without accounting for sidecar resource consumption.
Result: Nodes run out of CPU and memory because every pod now has an Envoy sidecar consuming additional resources. Cluster autoscaler scales out, increasing costs unexpectedly.
Solution: Before enabling a service mesh, measure baseline resource usage and add sidecar overhead (typically 50-100 MB memory and 10-50m CPU per pod). Increase node pool size or adjust pod resource limits accordingly. Use lightweight meshes like Linkerd for resource-constrained environments.
Pitfall 3: KEDA Scaling to Zero Without Handling Cold Start Latency
Problem: Using KEDA to scale workloads to zero without accounting for pod startup time and readiness delays.
Result: First requests after scale-up fail or time out because the pod is not ready yet. Users experience errors during scale-up events.
Solution: Accept cold start latency as a trade-off for cost savings, or set KEDA minReplicaCount to 1 to keep at least one pod warm. Use readiness probes to ensure pods are fully ready before receiving traffic. For latency-sensitive workloads, use standard HPA instead of KEDA.
Pitfall 4: Mixing Workload Identity and Legacy Pod Identity
Problem: Migrating from Pod Identity to Workload Identity incrementally, leaving both systems running simultaneously.
Result: Conflicting authentication configurations cause pods to fail Azure authentication intermittently. Troubleshooting is difficult because errors vary based on which identity system the pod uses.
Solution: Plan a complete migration from Pod Identity to Workload Identity. Disable Pod Identity after migration completes. Use namespace-based phased migration if necessary, but avoid long-term coexistence.
Pitfall 5: Not Setting Pod Disruption Budgets with Cluster Autoscaler
Problem: Enabling Cluster Autoscaler without defining Pod Disruption Budgets (PDBs) for critical services.
Result: Autoscaler drains nodes aggressively during scale-down, terminating too many replicas simultaneously and causing service outages.
Solution: Define PDBs for all services specifying the minimum available replicas during disruptions. Example: minAvailable: 2 for a 3-replica deployment ensures at least 2 replicas remain during scale-down.
Pitfall 6: GitOps Repository Structure Causing Deployment Bottlenecks
Problem: Using a monolithic Git repository for all applications, causing merge conflicts and bottlenecks when multiple teams deploy simultaneously.
Result: Deployments are delayed while teams resolve merge conflicts. Single repository becomes a coordination burden.
Solution: Separate Git repositories by team or by workload. Use Fluxβs multi-source capabilities to compose manifests from multiple repositories. Balance between too many repositories (management overhead) and too few (coordination bottleneck).
Key Takeaways
-
Azure CNI Overlay and Cilium reduce IP consumption for large clusters. Standard Azure CNI assigns VNet IPs to every pod, which exhausts subnets quickly. Overlay networking decouples pod IPs from VNet address space, enabling clusters with 100,000+ pods without consuming VNet IPs.
-
Service meshes add significant value but also complexity and cost. Istio provides powerful traffic management and zero-trust security, but the resource overhead and operational complexity are real. Evaluate whether application-level observability and network policies meet your needs before adopting a mesh.
-
KEDA enables cost-effective event-driven workloads by scaling to zero. Integrating KEDA with Azure Service Bus, Event Hubs, or Storage Queues makes background processing workloads respond to demand dynamically while eliminating costs during idle periods. Handle cold start latency appropriately.
-
GitOps with Flux provides auditability, consistency, and security. Using Git as the single source of truth for cluster configuration and application manifests eliminates kubectl-based drift, provides rollback capability, and enforces review processes for all changes.
-
Workload Identity replaces Pod Identity with simpler, more secure Azure authentication. Federation-based authentication eliminates credential storage entirely and reduces the attack surface. Migrate legacy Pod Identity workloads to Workload Identity.
-
Separate system and user node pools for stability and cost optimization. System node pools run AKS control plane components on reliable, always-on VMs. User node pools scale dynamically and can use spot instances for cost savings.
-
Cluster Autoscaler is the stable autoscaling solution; watch Node Auto-Provisioning. Cluster Autoscaler scales predefined node pools based on pod scheduling. Node Auto-Provisioning dynamically creates optimal node pools but is still in preview. Use Cluster Autoscaler for production.
-
Azure Fleet Manager simplifies multi-cluster management at scale. Centralized orchestration, GitOps at scale, and cross-cluster service discovery make Fleet Manager essential for multi-region deployments and disaster recovery scenarios.
-
Spot node pools reduce costs by 60-90% for fault-tolerant workloads. Batch processing, CI/CD agents, and stateless applications benefit from spot instances. Distribute replicas across spot and on-demand pools to tolerate evictions.
-
Cost optimization requires a combination of strategies. Use spot instances for fault-tolerant workloads, Reserved Instances for stable baselines, cluster start/stop for dev/test environments, and Vertical Pod Autoscaler for right-sizing. Monitor costs continuously with Azure Cost Management.
Found this guide helpful? Share it with your team:
Share on LinkedIn