Performance Engineering

Architecture

What is Performance Engineering?

Performance engineering is the systematic practice of ensuring systems meet performance requirements throughout their lifecycle—from design through operation. It encompasses defining performance requirements, measuring current performance, identifying bottlenecks, optimizing critical paths, and planning for future capacity.

Core principle: Performance is a feature, not an afterthought. Systems that address performance requirements early in design avoid costly rewrites later.

Performance vs optimization: Performance engineering focuses on meeting defined requirements. Premature optimization wastes effort on parts of the system that don’t matter. Performance engineering identifies what matters, measures it, and optimizes intentionally.

Defining Performance Requirements

Performance requirements must be specific, measurable, and tied to business or user needs.

Performance Characteristics

Characteristic Definition Example Requirement
Latency Time to complete a single operation 95% of API requests complete in < 200ms
Throughput Operations per unit time System handles 10,000 requests/second
Scalability Performance under increasing load Linear scaling to 100,000 concurrent users
Resource efficiency Resource consumption per operation < 100MB memory per request
Startup time Time to become ready Service ready to accept traffic in < 30 seconds
Time to first byte (TTFB) Time until first response byte < 100ms TTFB for web pages

The Performance Requirements Framework

Step 1: Identify critical paths

Which user interactions matter most? Focus performance engineering on high-value, high-frequency operations.

Examples:

  • E-commerce: Product search, checkout, payment processing
  • Social media: Feed loading, posting, notifications
  • SaaS: Dashboard rendering, report generation, data exports

Step 2: Define success criteria

What latency is acceptable? What throughput must be supported? Use percentiles, not averages.

Why percentiles matter:

Average latency: 50ms (looks great!)

Actual distribution:
p50: 20ms
p95: 150ms
p99: 800ms    ← Many users experience this
p99.9: 5000ms ← Some users see this nightmare

Percentile selection guidance:

  • p50 (median): Typical user experience
  • p95: Good user experience threshold
  • p99: Acceptable worst case for most users
  • p99.9: Extreme outliers, often acceptable to exceed SLO

Step 3: Set performance budgets

Allocate latency across system components. Each component gets a portion of the total budget.

Example: 200ms API latency budget

Component               Budget
─────────────────────────────────
API gateway             20ms
Authentication          10ms
Business logic          50ms
Database query          80ms
External API call       30ms
Response serialization  10ms
─────────────────────────────────
Total                   200ms

Budget violations trigger optimization: If database queries consume 150ms, you’ve exceeded the budget and must optimize.

Performance vs Other Characteristics

Performance competes with other architectural characteristics. Tradeoffs are inevitable.

Tradeoff Description When to Favor Performance
Performance vs Maintainability Optimized code is often complex High-frequency code paths, critical user flows
Performance vs Security Encryption and validation add latency Only when security requirements allow
Performance vs Reliability Retries and redundancy add latency User-facing operations, real-time systems
Performance vs Cost Faster hardware costs more Revenue-generating features, competitive advantage
Performance vs Flexibility Generic solutions are often slower Stable, well-defined requirements

Guidance: Optimize intentionally. Don’t sacrifice maintainability for marginal performance gains in low-traffic code paths.

Measuring Performance

You cannot improve what you don’t measure. Establish baseline performance before optimizing.

Performance Profiling

Profiling identifies where time is spent in your system.

Profiling techniques:

CPU profiling: Measure which functions consume CPU time.

  • Identifies compute-bound bottlenecks
  • Shows hot paths (frequently executed code)
  • Reveals inefficient algorithms

Memory profiling: Track memory allocation and usage.

  • Identifies memory leaks
  • Shows allocation hot spots
  • Reveals unnecessary object creation

I/O profiling: Measure disk and network I/O.

  • Identifies blocking I/O operations
  • Shows unnecessary file reads/writes
  • Reveals network chattiness

Database profiling: Analyze query performance.

  • Identifies slow queries
  • Shows missing indexes
  • Reveals N+1 query problems

Profiling best practices:

  • Profile in production-like environments (not development laptops)
  • Use realistic data volumes
  • Profile under realistic load
  • Focus on representative user journeys
  • Profile both hot paths and outliers

Application Performance Monitoring (APM)

APM tools provide continuous performance visibility in production.

Key APM capabilities:

  • Transaction tracing: Track requests across services
  • Error tracking: Correlate errors with performance
  • Resource monitoring: CPU, memory, disk, network usage
  • Database monitoring: Query performance and connection pools
  • External service monitoring: Third-party API latency

APM alerts:

  • Alert when p95 latency exceeds SLO
  • Alert on error rate spikes
  • Alert on resource exhaustion (memory, connections)

Synthetic Monitoring

Simulate user interactions to measure performance continuously.

Synthetic transaction types:

  • Page load tests: Measure web page rendering time
  • API tests: Call endpoints and measure latency
  • User journey tests: Complete multi-step workflows

Benefits:

  • Detect performance regressions before users do
  • Measure performance from multiple geographic regions
  • Establish performance baselines over time

Load Testing

Simulate production traffic to validate performance under stress.

Load test types:

Type Purpose Traffic Pattern Duration
Smoke test Verify system handles minimal load Low, constant Minutes
Load test Validate expected production traffic Realistic, sustained Hours
Stress test Find breaking point Gradually increasing Until failure
Spike test Handle sudden traffic surges Instant spike Minutes
Soak test Detect memory leaks, resource exhaustion Sustained high load Days

Load testing best practices:

  • Test in production-like environment
  • Use realistic user behavior (think time, navigation patterns)
  • Include realistic data volumes
  • Measure latency percentiles, not just averages
  • Monitor resource utilization during tests
  • Identify bottlenecks before reaching capacity

Performance Optimization Strategies

The Optimization Hierarchy

Optimize in this order for maximum impact:

1. Algorithmic complexity (biggest impact)

  • Replace O(n²) algorithm with O(n log n) algorithm
  • Use appropriate data structures (hash map vs list)
  • Eliminate unnecessary computation

2. Database optimization (high impact)

  • Add indexes to frequently queried columns
  • Optimize query structure (avoid N+1 queries)
  • Use connection pooling
  • Cache query results

3. Caching (high impact, low effort)

  • Cache computed results
  • Cache database queries
  • Cache external API responses
  • Use CDN for static assets

4. Parallelization (moderate impact)

  • Process independent tasks concurrently
  • Use asynchronous I/O
  • Parallelize CPU-intensive operations

5. Micro-optimizations (low impact, high effort)

  • String concatenation optimizations
  • Reducing object allocations
  • Inline functions

Guidance: Measure before optimizing. Profile to find actual bottlenecks. Optimizing the wrong thing wastes effort.

Caching Strategies

Caching is the highest ROI optimization for most systems.

What to cache:

  • Database query results
  • Computed values (aggregations, calculations)
  • External API responses
  • Rendered HTML or JSON
  • Static assets (images, CSS, JavaScript)

Caching patterns (see Performance Scalability Patterns for detailed implementations):

  • Cache-aside: Application manages cache explicitly
  • Read-through: Cache automatically loads missing data
  • Write-through: Writes go to cache and database simultaneously
  • Write-behind: Writes batched and flushed asynchronously

Cache invalidation strategies:

  • Time-to-live (TTL): Expire entries after fixed duration
  • Event-based: Invalidate when underlying data changes
  • Manual: Application explicitly invalidates entries
  • Least Recently Used (LRU): Evict oldest unused entries when full

Caching anti-patterns:

  • Caching everything (wastes memory on rarely-accessed data)
  • Caching mutable data without invalidation (serves stale data)
  • Setting TTL too long (stale data) or too short (cache thrashing)
  • Ignoring cache stampede (many requests simultaneously fetch same missing data)

Cache stampede mitigation:

  • Use distributed locks to ensure only one request fetches missing data
  • Serve stale data while refreshing in background
  • Use probabilistic early expiration (refresh before TTL expires)

Database Optimization

Databases are often the bottleneck. Optimize queries before adding hardware.

Query optimization techniques:

Indexing:

  • Add indexes to frequently queried columns
  • Use covering indexes (index includes all selected columns)
  • Avoid over-indexing (indexes slow down writes)
  • Monitor index usage, remove unused indexes

Query structure:

  • Avoid SELECT * (fetch only needed columns)
  • Use appropriate JOIN types
  • Filter early (WHERE before JOIN when possible)
  • Limit result sets (pagination)

N+1 query elimination:

Problem:
SELECT * FROM orders WHERE user_id = 123;
For each order:
  SELECT * FROM items WHERE order_id = ?;  ← N queries

Solution:
SELECT * FROM orders WHERE user_id = 123;
SELECT * FROM items WHERE order_id IN (?, ?, ?);  ← 1 query

Connection pooling:

  • Reuse database connections instead of opening new ones
  • Set appropriate pool size (typically 10-50 connections)
  • Monitor pool utilization, adjust based on load

Read replicas:

  • Route read queries to replicas
  • Reserve primary database for writes
  • Accept eventual consistency for read replicas

Database sharding (for extreme scale):

  • Partition data across multiple databases
  • Shard by tenant, geography, or hash key
  • Increases complexity, only use when necessary

Asynchronous Processing

Move non-critical work out of the request path.

When to use async processing:

  • Email sending
  • Report generation
  • Image processing
  • Data exports
  • Batch operations
  • Non-critical third-party API calls

Async patterns:

Message queues: Decouple request handling from background processing.

User request → API → Queue → Background worker
                  ↓
              Immediate response (202 Accepted)

Event-driven architecture: Emit events, process asynchronously.

Order placed → OrderPlacedEvent → [Email worker, Analytics worker, Inventory worker]

Scheduled jobs: Batch process data at off-peak times.

Nightly job aggregates daily metrics
Hourly job syncs data to warehouse

Benefits:

  • Faster response times (no waiting for slow operations)
  • Better resilience (queue buffers traffic spikes)
  • Independent scaling (scale workers separately from API)

Tradeoffs:

  • Eventual consistency (work happens later)
  • Increased complexity (distributed system with queues)
  • Debugging difficulty (async failures are harder to trace)

Compression

Reduce data transfer size.

What to compress:

  • API responses (gzip, brotli)
  • Static assets (CSS, JavaScript)
  • Large payloads (file uploads, exports)
  • Log files

Compression tradeoffs:

  • CPU cost: Compression uses CPU cycles
  • Latency: Adds compression/decompression time
  • Bandwidth savings: Reduces network transfer time

When compression wins:

  • Large payloads (> 1KB)
  • Slow networks (mobile, international)
  • Compressible content (text, JSON, HTML)

When compression loses:

  • Small payloads (overhead exceeds savings)
  • Fast networks (compression time > transfer savings)
  • Already compressed content (images, videos)

Content Delivery Networks (CDN)

Serve static content from edge locations near users.

What to serve from CDN:

  • Images, videos, audio
  • CSS, JavaScript
  • Fonts
  • API responses (for cacheable endpoints)

CDN benefits:

  • Reduced latency (geographically closer to users)
  • Reduced origin load (CDN handles static content)
  • DDoS protection (CDN absorbs traffic)

CDN considerations:

  • Cache invalidation strategy
  • Cache hit rate monitoring
  • Cost (bandwidth, requests)

Capacity Planning

Capacity planning ensures systems handle future growth without performance degradation.

Capacity Planning Methodology

Step 1: Establish current capacity

Measure current throughput and resource utilization at normal and peak load.

Metrics to collect:

  • Requests per second
  • CPU utilization
  • Memory utilization
  • Database connections
  • Network bandwidth
  • Disk I/O

Step 2: Project growth

Estimate future traffic based on business projections.

Growth patterns:

  • Linear growth: Steady user acquisition (10% per quarter)
  • Seasonal spikes: Holiday shopping, tax season
  • Event-driven spikes: Product launches, viral content
  • Step changes: New market entry, major feature launch

Step 3: Identify constraints

Which resource will be exhausted first?

Common constraints:

  • CPU capacity
  • Memory limits
  • Database connection pool
  • Network bandwidth
  • Storage capacity

Step 4: Plan capacity increases

Add capacity before constraints are reached.

Capacity increase strategies:

  • Vertical scaling: Larger instances (temporary, limited)
  • Horizontal scaling: More instances (sustainable, unlimited)
  • Architectural changes: Caching, sharding, async processing

Step 5: Test capacity

Load test with projected future traffic to validate capacity plan.

Autoscaling

Automatically adjust capacity based on demand.

Autoscaling triggers:

  • CPU utilization: Scale when CPU > 70%
  • Request queue depth: Scale when queue > 100 requests
  • Custom metrics: Scale based on business metrics (orders/minute)

Autoscaling considerations:

  • Scale-up delay: Time to provision and start new instances
  • Scale-down caution: Don’t scale down too aggressively
  • Warm-up period: New instances may need time to reach full capacity
  • Cost implications: Autoscaling can increase costs unexpectedly

Autoscaling best practices:

  • Set minimum and maximum instance counts
  • Use predictive scaling for known traffic patterns
  • Test scale-up and scale-down scenarios
  • Monitor scaling events and costs

Performance Testing Methodology

Performance testing validates that systems meet requirements under realistic conditions.

Test Environment Setup

Production parity:

  • Same instance types and configurations
  • Same database size and schema
  • Same network topology
  • Same third-party integrations (or realistic mocks)

Data volume:

  • Use production-scale data
  • Include data distribution (new users, power users)
  • Account for data growth over test duration

Test isolation:

  • Dedicated environment (not shared with other testing)
  • Isolated from production (no accidental traffic)

Test Design

Define test scenarios:

  • Identify critical user journeys
  • Model realistic user behavior (think time, navigation)
  • Include representative distribution of operations (reads, writes)

Set acceptance criteria:

  • Latency percentiles (p50, p95, p99)
  • Error rate thresholds (< 0.1%)
  • Resource utilization limits (CPU < 80%)

Ramp-up strategy:

  • Start with low load
  • Gradually increase to target load
  • Observe system behavior at each step
  • Identify point where performance degrades

Test Execution

Monitor during test:

  • Application latency (all percentiles)
  • Error rates (by type)
  • Resource utilization (CPU, memory, network, disk)
  • Database performance (query times, connection pool)
  • Third-party API latency

Identify bottlenecks:

  • CPU-bound: CPU utilization high, adding instances helps
  • Memory-bound: Memory exhausted, larger instances or caching helps
  • I/O-bound: Disk or network saturated, optimize I/O or scale storage
  • Database-bound: Database queries slow, optimize queries or scale database

Iterate and optimize:

  • Fix identified bottlenecks
  • Re-run tests to validate improvements
  • Repeat until acceptance criteria met

Test Analysis

Performance regression detection:

  • Compare current test results to baseline
  • Alert on latency increase (p95 > 10% slower)
  • Alert on throughput decrease (handles 10% fewer requests)

Trend analysis:

  • Track performance over multiple releases
  • Identify gradual degradation
  • Correlate performance changes with code changes

Performance Patterns and Anti-Patterns

Patterns for High Performance

Request coalescing: Batch multiple requests into single operation.

  • Combine multiple database queries into one query
  • Batch API calls to external services
  • Use GraphQL to fetch multiple resources in one request

Lazy loading: Defer loading data until actually needed.

  • Don’t fetch related entities unless accessed
  • Paginate large result sets
  • Load images on scroll (web applications)

Data denormalization: Trade storage for query performance.

  • Duplicate frequently accessed data
  • Pre-compute aggregations
  • Store derived data alongside source data

Connection pooling: Reuse expensive connections.

  • Database connections
  • HTTP connections to external services
  • Thread pools for async operations

Bloom filters: Quickly check for non-existence.

  • Avoid expensive lookups when data doesn’t exist
  • Cache negative results (item not found)

Performance Anti-Patterns

Premature optimization: Optimizing before measuring.

  • Solution: Profile first, optimize actual bottlenecks

Over-fetching: Retrieving more data than needed.

  • Solution: Fetch only required columns, paginate results

Blocking I/O on critical path: Waiting for slow operations.

  • Solution: Use async I/O, move work to background jobs

Inefficient serialization: Slow JSON/XML parsing.

  • Solution: Use binary formats (Protocol Buffers, MessagePack) for internal APIs

Unbounded resource consumption: No limits on memory, connections.

  • Solution: Set connection pool limits, implement backpressure

Ignoring caching opportunities: Repeatedly computing same results.

  • Solution: Cache computed results, use memoization

Death by a thousand cuts: Many small inefficiencies compound.

  • Solution: Profile to identify cumulative impact, fix most significant first

Performance Culture

Performance engineering is most effective when embedded in team culture.

Performance Budgets in Development

Set performance budgets for features:

  • New feature cannot degrade p95 latency by > 10ms
  • Bundle size cannot exceed 200KB (web applications)
  • API response time must stay under 200ms

Enforce budgets in CI/CD:

  • Run performance tests in CI pipeline
  • Fail builds that exceed budgets
  • Require performance review for risky changes

Performance Reviews

Include performance in code reviews:

  • Review database queries for efficiency
  • Check for N+1 queries
  • Validate caching strategy
  • Question synchronous calls to external services

Performance retrospectives:

  • Review performance incidents
  • Identify systemic performance issues
  • Prioritize performance improvements

Continuous Performance Monitoring

Track performance metrics over time:

  • Dashboard showing latency trends
  • Alert on performance regressions
  • Correlate deployments with performance changes

Performance goals in planning:

  • Allocate sprints to performance improvements
  • Balance features with performance work
  • Use error budgets to guide prioritization

Key Takeaways

Define performance requirements early: Specify latency, throughput, and scalability targets before building. Use percentiles, not averages.

Measure before optimizing: Profile to identify actual bottlenecks. Don’t waste effort optimizing code that doesn’t impact performance.

Optimize in priority order: Algorithmic improvements and caching provide the highest ROI. Micro-optimizations rarely matter.

Performance budgets guide decisions: Allocate latency across components. Exceeding budget triggers optimization work.

Caching is the highest-impact optimization: Cache database queries, computed results, and external API responses. Manage cache invalidation carefully.

Database optimization is critical: Add indexes, eliminate N+1 queries, use connection pooling. Optimize queries before scaling hardware.

Async processing improves responsiveness: Move non-critical work out of request path. Use message queues and background workers.

Capacity planning prevents outages: Project growth, identify constraints, add capacity before limits are reached. Test capacity plans with load testing.

Load testing validates performance: Test under realistic load in production-like environments. Monitor all layers of the stack during tests.

Performance engineering is continuous: Monitor performance in production, detect regressions, iterate on improvements. Embed performance in development culture.

Found this guide helpful? Share it with your team:

Share on LinkedIn