Performance Engineering

Architecture

What is Performance Engineering?

Performance engineering is the systematic practice of ensuring systems meet performance requirements throughout their lifecycle—from design through operation. It encompasses defining performance requirements, measuring current performance, identifying bottlenecks, optimizing critical paths, and planning for future capacity.

Core principle: Performance is a feature, not an afterthought. Systems that address performance requirements early in design avoid costly rewrites later.

Performance vs optimization: Performance engineering focuses on meeting defined requirements. Premature optimization wastes effort on parts of the system that don’t matter. Performance engineering identifies what matters, measures it, and optimizes intentionally.

Defining Performance Requirements

Performance requirements must be specific, measurable, and tied to business or user needs.

Performance Characteristics

Characteristic	Definition	Example Requirement
Latency	Time to complete a single operation	95% of API requests complete in < 200ms
Throughput	Operations per unit time	System handles 10,000 requests/second
Scalability	Performance under increasing load	Linear scaling to 100,000 concurrent users
Resource efficiency	Resource consumption per operation	< 100MB memory per request
Startup time	Time to become ready	Service ready to accept traffic in < 30 seconds
Time to first byte (TTFB)	Time until first response byte	< 100ms TTFB for web pages

The Performance Requirements Framework

Step 1: Identify critical paths

Which user interactions matter most? Focus performance engineering on high-value, high-frequency operations.

Examples:

E-commerce: Product search, checkout, payment processing
Social media: Feed loading, posting, notifications
SaaS: Dashboard rendering, report generation, data exports

Step 2: Define success criteria

What latency is acceptable? What throughput must be supported? Use percentiles, not averages.

Why percentiles matter:

Average latency: 50ms (looks great!)

Actual distribution:
p50: 20ms
p95: 150ms
p99: 800ms    ← Many users experience this
p99.9: 5000ms ← Some users see this nightmare

Percentile selection guidance:

p50 (median): Typical user experience
p95: Good user experience threshold
p99: Acceptable worst case for most users
p99.9: Extreme outliers, often acceptable to exceed SLO

Step 3: Set performance budgets

Allocate latency across system components. Each component gets a portion of the total budget.

Example: 200ms API latency budget

Component               Budget
─────────────────────────────────
API gateway             20ms
Authentication          10ms
Business logic          50ms
Database query          80ms
External API call       30ms
Response serialization  10ms
─────────────────────────────────
Total                   200ms

Budget violations trigger optimization: If database queries consume 150ms, you’ve exceeded the budget and must optimize.

Performance vs Other Characteristics

Performance competes with other architectural characteristics. Tradeoffs are inevitable.

Tradeoff	Description	When to Favor Performance
Performance vs Maintainability	Optimized code is often complex	High-frequency code paths, critical user flows
Performance vs Security	Encryption and validation add latency	Only when security requirements allow
Performance vs Reliability	Retries and redundancy add latency	User-facing operations, real-time systems
Performance vs Cost	Faster hardware costs more	Revenue-generating features, competitive advantage
Performance vs Flexibility	Generic solutions are often slower	Stable, well-defined requirements

Guidance: Optimize intentionally. Don’t sacrifice maintainability for marginal performance gains in low-traffic code paths.

Measuring Performance

You cannot improve what you don’t measure. Establish baseline performance before optimizing.

Performance Profiling

Profiling identifies where time is spent in your system.

Profiling techniques:

CPU profiling: Measure which functions consume CPU time.

Identifies compute-bound bottlenecks
Shows hot paths (frequently executed code)
Reveals inefficient algorithms

Memory profiling: Track memory allocation and usage.

Identifies memory leaks
Shows allocation hot spots
Reveals unnecessary object creation

I/O profiling: Measure disk and network I/O.

Identifies blocking I/O operations
Shows unnecessary file reads/writes
Reveals network chattiness

Database profiling: Analyze query performance.

Identifies slow queries
Shows missing indexes
Reveals N+1 query problems

Profiling best practices:

Profile in production-like environments (not development laptops)
Use realistic data volumes
Profile under realistic load
Focus on representative user journeys
Profile both hot paths and outliers

Application Performance Monitoring (APM)

APM tools provide continuous performance visibility in production.

Key APM capabilities:

Transaction tracing: Track requests across services
Error tracking: Correlate errors with performance
Resource monitoring: CPU, memory, disk, network usage
Database monitoring: Query performance and connection pools
External service monitoring: Third-party API latency

APM alerts:

Alert when p95 latency exceeds SLO
Alert on error rate spikes
Alert on resource exhaustion (memory, connections)

Synthetic Monitoring

Simulate user interactions to measure performance continuously.

Synthetic transaction types:

Page load tests: Measure web page rendering time
API tests: Call endpoints and measure latency
User journey tests: Complete multi-step workflows

Benefits:

Detect performance regressions before users do
Measure performance from multiple geographic regions
Establish performance baselines over time

Load Testing

Simulate production traffic to validate performance under stress.

Load test types:

Type	Purpose	Traffic Pattern	Duration
Smoke test	Verify system handles minimal load	Low, constant	Minutes
Load test	Validate expected production traffic	Realistic, sustained	Hours
Stress test	Find breaking point	Gradually increasing	Until failure
Spike test	Handle sudden traffic surges	Instant spike	Minutes
Soak test	Detect memory leaks, resource exhaustion	Sustained high load	Days

Load testing best practices:

Test in production-like environment
Use realistic user behavior (think time, navigation patterns)
Include realistic data volumes
Measure latency percentiles, not just averages
Monitor resource utilization during tests
Identify bottlenecks before reaching capacity

Performance Optimization Strategies

The Optimization Hierarchy

Optimize in this order for maximum impact:

1. Algorithmic complexity (biggest impact)

Replace O(n²) algorithm with O(n log n) algorithm
Use appropriate data structures (hash map vs list)
Eliminate unnecessary computation

2. Database optimization (high impact)

Add indexes to frequently queried columns
Optimize query structure (avoid N+1 queries)
Use connection pooling
Cache query results

3. Caching (high impact, low effort)

Cache computed results
Cache database queries
Cache external API responses
Use CDN for static assets

4. Parallelization (moderate impact)

Process independent tasks concurrently
Use asynchronous I/O
Parallelize CPU-intensive operations

5. Micro-optimizations (low impact, high effort)

String concatenation optimizations
Reducing object allocations
Inline functions

Guidance: Measure before optimizing. Profile to find actual bottlenecks. Optimizing the wrong thing wastes effort.

Caching Strategies

Caching is the highest ROI optimization for most systems.

What to cache:

Database query results
Computed values (aggregations, calculations)
External API responses
Rendered HTML or JSON
Static assets (images, CSS, JavaScript)

Caching patterns (see Performance Scalability Patterns for detailed implementations):

Cache-aside: Application manages cache explicitly
Read-through: Cache automatically loads missing data
Write-through: Writes go to cache and database simultaneously
Write-behind: Writes batched and flushed asynchronously

Cache invalidation strategies:

Time-to-live (TTL): Expire entries after fixed duration
Event-based: Invalidate when underlying data changes
Manual: Application explicitly invalidates entries
Least Recently Used (LRU): Evict oldest unused entries when full

Caching anti-patterns:

Caching everything (wastes memory on rarely-accessed data)
Caching mutable data without invalidation (serves stale data)
Setting TTL too long (stale data) or too short (cache thrashing)
Ignoring cache stampede (many requests simultaneously fetch same missing data)

Cache stampede mitigation:

Use distributed locks to ensure only one request fetches missing data
Serve stale data while refreshing in background
Use probabilistic early expiration (refresh before TTL expires)

Database Optimization

Databases are often the bottleneck. Optimize queries before adding hardware.

Query optimization techniques:

Indexing:

Add indexes to frequently queried columns
Use covering indexes (index includes all selected columns)
Avoid over-indexing (indexes slow down writes)
Monitor index usage, remove unused indexes

Query structure:

Avoid SELECT * (fetch only needed columns)
Use appropriate JOIN types
Filter early (WHERE before JOIN when possible)
Limit result sets (pagination)

N+1 query elimination:

Problem:
SELECT * FROM orders WHERE user_id = 123;
For each order:
  SELECT * FROM items WHERE order_id = ?;  ← N queries

Solution:
SELECT * FROM orders WHERE user_id = 123;
SELECT * FROM items WHERE order_id IN (?, ?, ?);  ← 1 query

Connection pooling:

Reuse database connections instead of opening new ones
Set appropriate pool size (typically 10-50 connections)
Monitor pool utilization, adjust based on load

Read replicas:

Route read queries to replicas
Reserve primary database for writes
Accept eventual consistency for read replicas

Database sharding (for extreme scale):

Partition data across multiple databases
Shard by tenant, geography, or hash key
Increases complexity, only use when necessary

Asynchronous Processing

Move non-critical work out of the request path.

When to use async processing:

Email sending
Report generation
Image processing
Data exports
Batch operations
Non-critical third-party API calls

Async patterns:

Message queues: Decouple request handling from background processing.

User request → API → Queue → Background worker
                  ↓
              Immediate response (202 Accepted)

Event-driven architecture: Emit events, process asynchronously.

Order placed → OrderPlacedEvent → [Email worker, Analytics worker, Inventory worker]

Scheduled jobs: Batch process data at off-peak times.

Nightly job aggregates daily metrics
Hourly job syncs data to warehouse

Benefits:

Faster response times (no waiting for slow operations)
Better resilience (queue buffers traffic spikes)
Independent scaling (scale workers separately from API)

Tradeoffs:

Eventual consistency (work happens later)
Increased complexity (distributed system with queues)
Debugging difficulty (async failures are harder to trace)

Compression

Reduce data transfer size.

What to compress:

API responses (gzip, brotli)
Static assets (CSS, JavaScript)
Large payloads (file uploads, exports)
Log files

Compression tradeoffs:

CPU cost: Compression uses CPU cycles
Latency: Adds compression/decompression time
Bandwidth savings: Reduces network transfer time

When compression wins:

Large payloads (> 1KB)
Slow networks (mobile, international)
Compressible content (text, JSON, HTML)

When compression loses:

Small payloads (overhead exceeds savings)
Fast networks (compression time > transfer savings)
Already compressed content (images, videos)

Content Delivery Networks (CDN)

Serve static content from edge locations near users.

What to serve from CDN:

Images, videos, audio
CSS, JavaScript
Fonts
API responses (for cacheable endpoints)

CDN benefits:

Reduced latency (geographically closer to users)
Reduced origin load (CDN handles static content)
DDoS protection (CDN absorbs traffic)

CDN considerations:

Cache invalidation strategy
Cache hit rate monitoring
Cost (bandwidth, requests)

Capacity Planning

Capacity planning ensures systems handle future growth without performance degradation.

Capacity Planning Methodology

Step 1: Establish current capacity

Measure current throughput and resource utilization at normal and peak load.

Metrics to collect:

Requests per second
CPU utilization
Memory utilization
Database connections
Network bandwidth
Disk I/O

Step 2: Project growth

Estimate future traffic based on business projections.

Growth patterns:

Linear growth: Steady user acquisition (10% per quarter)
Seasonal spikes: Holiday shopping, tax season
Event-driven spikes: Product launches, viral content
Step changes: New market entry, major feature launch

Step 3: Identify constraints

Which resource will be exhausted first?

Common constraints:

CPU capacity
Memory limits
Database connection pool
Network bandwidth
Storage capacity

Step 4: Plan capacity increases

Add capacity before constraints are reached.

Capacity increase strategies:

Vertical scaling: Larger instances (temporary, limited)
Horizontal scaling: More instances (sustainable, unlimited)
Architectural changes: Caching, sharding, async processing

Step 5: Test capacity

Load test with projected future traffic to validate capacity plan.

Autoscaling

Automatically adjust capacity based on demand.

Autoscaling triggers:

CPU utilization: Scale when CPU > 70%
Request queue depth: Scale when queue > 100 requests
Custom metrics: Scale based on business metrics (orders/minute)

Autoscaling considerations:

Scale-up delay: Time to provision and start new instances
Scale-down caution: Don’t scale down too aggressively
Warm-up period: New instances may need time to reach full capacity
Cost implications: Autoscaling can increase costs unexpectedly

Autoscaling best practices:

Set minimum and maximum instance counts
Use predictive scaling for known traffic patterns
Test scale-up and scale-down scenarios
Monitor scaling events and costs

Performance Testing Methodology

Performance testing validates that systems meet requirements under realistic conditions.

Test Environment Setup

Production parity:

Same instance types and configurations
Same database size and schema
Same network topology
Same third-party integrations (or realistic mocks)

Data volume:

Use production-scale data
Include data distribution (new users, power users)
Account for data growth over test duration

Test isolation:

Dedicated environment (not shared with other testing)
Isolated from production (no accidental traffic)

Test Design

Define test scenarios:

Identify critical user journeys
Model realistic user behavior (think time, navigation)
Include representative distribution of operations (reads, writes)

Set acceptance criteria:

Latency percentiles (p50, p95, p99)
Error rate thresholds (< 0.1%)
Resource utilization limits (CPU < 80%)

Ramp-up strategy:

Start with low load
Gradually increase to target load
Observe system behavior at each step
Identify point where performance degrades

Test Execution

Monitor during test:

Application latency (all percentiles)
Error rates (by type)
Resource utilization (CPU, memory, network, disk)
Database performance (query times, connection pool)
Third-party API latency

Identify bottlenecks:

CPU-bound: CPU utilization high, adding instances helps
Memory-bound: Memory exhausted, larger instances or caching helps
I/O-bound: Disk or network saturated, optimize I/O or scale storage
Database-bound: Database queries slow, optimize queries or scale database

Iterate and optimize:

Fix identified bottlenecks
Re-run tests to validate improvements
Repeat until acceptance criteria met

Test Analysis

Performance regression detection:

Compare current test results to baseline
Alert on latency increase (p95 > 10% slower)
Alert on throughput decrease (handles 10% fewer requests)

Trend analysis:

Track performance over multiple releases
Identify gradual degradation
Correlate performance changes with code changes

Performance Patterns and Anti-Patterns

Patterns for High Performance

Request coalescing: Batch multiple requests into single operation.

Combine multiple database queries into one query
Batch API calls to external services
Use GraphQL to fetch multiple resources in one request

Lazy loading: Defer loading data until actually needed.

Don’t fetch related entities unless accessed
Paginate large result sets
Load images on scroll (web applications)

Data denormalization: Trade storage for query performance.

Duplicate frequently accessed data
Pre-compute aggregations
Store derived data alongside source data

Connection pooling: Reuse expensive connections.

Database connections
HTTP connections to external services
Thread pools for async operations

Bloom filters: Quickly check for non-existence.

Avoid expensive lookups when data doesn’t exist
Cache negative results (item not found)

Performance Anti-Patterns

Premature optimization: Optimizing before measuring.

Solution: Profile first, optimize actual bottlenecks

Over-fetching: Retrieving more data than needed.

Solution: Fetch only required columns, paginate results

Blocking I/O on critical path: Waiting for slow operations.

Solution: Use async I/O, move work to background jobs

Inefficient serialization: Slow JSON/XML parsing.

Solution: Use binary formats (Protocol Buffers, MessagePack) for internal APIs

Unbounded resource consumption: No limits on memory, connections.

Solution: Set connection pool limits, implement backpressure

Ignoring caching opportunities: Repeatedly computing same results.

Solution: Cache computed results, use memoization

Death by a thousand cuts: Many small inefficiencies compound.

Solution: Profile to identify cumulative impact, fix most significant first

Performance Culture

Performance engineering is most effective when embedded in team culture.

Performance Budgets in Development

Set performance budgets for features:

New feature cannot degrade p95 latency by > 10ms
Bundle size cannot exceed 200KB (web applications)
API response time must stay under 200ms

Enforce budgets in CI/CD:

Run performance tests in CI pipeline
Fail builds that exceed budgets
Require performance review for risky changes

Performance Reviews

Include performance in code reviews:

Review database queries for efficiency
Check for N+1 queries
Validate caching strategy
Question synchronous calls to external services

Performance retrospectives:

Review performance incidents
Identify systemic performance issues
Prioritize performance improvements

Continuous Performance Monitoring

Track performance metrics over time:

Dashboard showing latency trends
Alert on performance regressions
Correlate deployments with performance changes

Performance goals in planning:

Allocate sprints to performance improvements
Balance features with performance work
Use error budgets to guide prioritization

Key Takeaways

Define performance requirements early: Specify latency, throughput, and scalability targets before building. Use percentiles, not averages.

Measure before optimizing: Profile to identify actual bottlenecks. Don’t waste effort optimizing code that doesn’t impact performance.

Optimize in priority order: Algorithmic improvements and caching provide the highest ROI. Micro-optimizations rarely matter.

Performance budgets guide decisions: Allocate latency across components. Exceeding budget triggers optimization work.

Caching is the highest-impact optimization: Cache database queries, computed results, and external API responses. Manage cache invalidation carefully.

Database optimization is critical: Add indexes, eliminate N+1 queries, use connection pooling. Optimize queries before scaling hardware.

Async processing improves responsiveness: Move non-critical work out of request path. Use message queues and background workers.

Capacity planning prevents outages: Project growth, identify constraints, add capacity before limits are reached. Test capacity plans with load testing.

Load testing validates performance: Test under realistic load in production-like environments. Monitor all layers of the stack during tests.

Performance engineering is continuous: Monitor performance in production, detect regressions, iterate on improvements. Embed performance in development culture.