Observability Architecture & Strategy
What is Observability Architecture?
Observability architecture addresses how to design systems to be observable, how to collect and correlate telemetry data efficiently, and how to use observability to drive reliability and decision-making. This goes beyond implementing the three pillars (logs, metrics, traces) to strategic questions about sampling, cardinality, cost management, and using observability data to improve systems.
Core principle: Observability is not something you bolt on after building a system. It must be architected from the start, with intentional decisions about what to measure, how to correlate data, and how to make telemetry actionable.
For observability fundamentals (three pillars, basic concepts), see Observability Fundamentals.
Beyond the Three Pillars: Correlation is Key
The three pillars (logs, metrics, traces) are useful individually, but their real power emerges when correlated.
The correlation problem:
User reports: "Checkout failed 5 minutes ago"
Logs show: Error at 14:47 UTC in payment-service
Metrics show: 95th percentile latency spike at 14:46 UTC
Traces show: Request timeout to payment gateway at 14:47 UTC
Without correlation: Three separate investigations
With correlation: Single root cause analysis
Correlation strategies:
- Trace IDs: Propagate unique identifier across all services in a request
- Resource attributes: Tag all telemetry with service name, version, environment
- Temporal correlation: Align time windows across logs, metrics, and traces
- Exemplars: Link metrics spikes to specific trace samples that caused them
Why correlation matters: Distributed systems fail in complex ways. A metric spike tells you something is wrong. A trace shows which request failed. Logs explain why it failed. Only by correlating all three can you understand the complete picture.
Service Level Objectives (SLOs) and Indicators (SLIs)
SLOs are the foundation of reliability engineering. They define what “good enough” means and guide observability strategy.
SLI: Service Level Indicator
A carefully defined quantitative measure of service behavior. The metric you actually measure.
Good SLIs have these properties:
- Measurable: You can actually collect this data
- Meaningful: Reflects user experience or business impact
- Simple: Easy to understand and explain
- Actionable: Can be improved through engineering work
Common SLI patterns:
| SLI Type | What it Measures | Example |
|---|---|---|
| Request-driven | Request success rate | 99.9% of API requests return 2xx/3xx status |
| Latency-based | Response time percentiles | 95% of requests complete in < 500ms |
| Availability | System uptime | Service responds to health checks 99.95% of time |
| Throughput | Processing capacity | System handles 10,000 requests/second |
| Durability | Data loss prevention | Zero data loss in storage system |
| Correctness | Data accuracy | 99.99% of calculations produce correct results |
SLI selection framework:
- Identify user-facing interactions (API calls, page loads, data queries)
- Define success criteria (status codes, latency thresholds, error types)
- Choose percentiles carefully (p50, p95, p99, p99.9)
- Consider both availability and latency
- Avoid vanity metrics (100% uptime is impossible and wasteful)
SLO: Service Level Objective
Target value or range for an SLI. The threshold for “good enough.”
Example SLOs:
- Availability: 99.9% of requests succeed (allows 43 minutes downtime/month)
- Latency: 95% of requests complete in < 500ms
- Durability: 99.999999999% (11 nines) of data retained annually
SLO structure:
SLO = SLI + Target + Time Window
Example: 99.9% of API requests return 2xx in the last 30 days
↑ ↑ ↑ ↑
Target SLI Success def Time window
Setting realistic SLOs:
- Measure current performance: Establish baseline before setting targets
- Align with user expectations: Ask “what would users tolerate?”
- Consider dependencies: Your SLO cannot exceed upstream dependencies
- Leave error budget: Don’t promise 100%, leave room for planned maintenance
- Iterate: Start conservative, tighten over time as system matures
Error Budgets
The acceptable amount of unreliability. If your SLO is 99.9%, your error budget is 0.1%.
Error budget calculation:
Error budget = 100% - SLO
If SLO = 99.9% availability
Error budget = 0.1% = 43 minutes/month
Spent error budget: 12 minutes so far this month
Remaining budget: 31 minutes
Error budget policy:
- Budget available: Focus on new features, take calculated risks
- Budget exhausted: Freeze feature releases, focus on reliability improvements
- Trend toward exhaustion: Shift some engineering to reliability work
Why error budgets matter: They balance innovation velocity with reliability. Without them, teams either ship recklessly or become paralyzed by fear of breaking things.
Implementing SLOs
Step 1: Choose SLIs
Pick 2-4 SLIs that matter most to users. Don’t try to measure everything.
Step 2: Instrument measurement
Collect data for chosen SLIs. This often requires:
- Structured logging of request outcomes
- Metrics tracking latency percentiles
- Synthetic transactions measuring user journeys
Step 3: Establish baselines
Measure current performance over 30-90 days. Understand normal behavior and failure modes.
Step 4: Set SLOs
Choose targets slightly better than current baseline. Align with user expectations and business needs.
Step 5: Alert on burn rate
Don’t alert on SLO violations directly. Alert when error budget is being consumed too fast.
Burn rate alerts:
1 hour window, fast burn: Consuming 5% of monthly budget/hour
6 hour window, medium burn: Consuming 1% of monthly budget/hour
24 hour window, slow burn: Consuming 0.5% of monthly budget/hour
Step 6: Review and iterate
SLOs are not set-in-stone. Review quarterly, adjust based on user feedback and business evolution.
Distributed Tracing at Scale
Distributed tracing is powerful but expensive. Tracing every request in a high-traffic system is cost-prohibitive and generates overwhelming data.
Sampling Strategies
Head-based sampling: Decide whether to trace a request at the beginning (when it enters the system).
Pros:
- Simple to implement
- Consistent trace coverage
- Predictable resource usage
Cons:
- May miss rare but important failures
- Cannot sample based on outcome (don’t know if request will fail)
Tail-based sampling: Decide whether to keep a trace after the request completes, based on its characteristics.
Pros:
- Keep all errors and slow requests
- Discard successful, fast requests
- Capture rare failures
Cons:
- Complex to implement (requires buffering traces)
- Higher resource overhead
- Inconsistent trace coverage
Adaptive sampling: Adjust sampling rate based on current traffic volume and error rates.
Example adaptive strategy:
Normal traffic: Sample 1% of requests
Traffic spike: Sample 0.1% of requests
Error spike: Sample 100% of errors, 1% of successes
Low traffic: Sample 10% of requests
Sampling best practices:
- Always trace errors and slow requests (above latency SLO)
- Sample successes at lower rate
- Increase sampling temporarily when debugging issues
- Sample different endpoints at different rates (critical paths higher)
- Include sampling decision in trace context (propagate to downstream services)
Trace Propagation
Traces only work if trace context propagates across service boundaries.
W3C Trace Context standard:
traceparent: Trace ID, parent span ID, sampling decisiontracestate: Vendor-specific context (optional)
Propagation challenges:
- Asynchronous processing: Queue messages, background jobs
- Multiple protocols: HTTP, gRPC, messaging systems
- Third-party services: External APIs may not support trace context
- Legacy systems: Cannot instrument old code
Solutions:
- Use automatic instrumentation libraries (OpenTelemetry)
- Propagate trace context in message headers/metadata
- Generate new trace IDs for third-party boundaries, link them to parent trace
- Synthetic spans to represent uninstrumented systems
Span Design
Spans represent operations within a trace. Good span design is critical for trace usability.
Span naming conventions:
- Use verb + resource pattern:
GET /orders,ProcessPayment,InsertDatabase - Include operation type in name, not as attribute
- Keep names stable across versions
- Avoid high cardinality (user IDs, timestamps in names)
Span attributes:
- Tag spans with resource-level attributes (service name, version, environment)
- Include operation-level attributes (HTTP method, status code, database table)
- Add business context (customer type, transaction amount ranges)
- Avoid sensitive data (passwords, tokens, PII)
Span relationships:
- ChildOf: Parent waits for child to complete (synchronous)
- FollowsFrom: Parent does not wait for child (asynchronous)
Common span antipatterns:
- Spans too granular (every function call becomes a span)
- Spans too coarse (entire service is one span)
- Missing critical spans (database queries, external API calls)
- High cardinality attributes (user IDs, unique identifiers)
Cardinality Management
Cardinality is the number of unique values for a metric dimension or label. High cardinality causes exponential cost and performance problems.
The Cardinality Explosion
Example:
Metric: http_requests_total
Labels:
- service: 10 services
- endpoint: 50 endpoints/service
- status_code: 6 codes
- user_id: 1,000,000 users ← High cardinality!
Total time series: 10 × 50 × 6 × 1,000,000 = 3 billion time series
Why cardinality matters:
- Storage costs: Each unique label combination creates a new time series
- Query performance: Aggregating billions of time series is slow
- Memory overhead: Metrics systems cache metadata for all time series
Managing Cardinality
Low cardinality labels (safe to use):
- Service name (tens of services)
- Environment (dev, staging, production)
- Region (handful of AWS regions)
- HTTP method (GET, POST, PUT, DELETE, PATCH)
- Status code ranges (2xx, 3xx, 4xx, 5xx)
High cardinality labels (avoid):
- User IDs
- Session IDs
- Request IDs
- IP addresses
- Unique identifiers
- Timestamps
Strategies to reduce cardinality:
Bucketing: Group high cardinality values into ranges.
Bad: response_time{milliseconds="347"}
Good: response_time_bucket{le="500"}
Sampling: Only record a percentage of events.
Record 1% of user interactions, not all 1 million users
Aggregation: Pre-aggregate before storing metrics.
Bad: user_purchases{user_id="12345"}
Good: purchases_total{user_type="premium"}
Exemplars: Link metrics to trace samples instead of exploding cardinality.
http_requests_total{service="api", code="500"} = 127
exemplar: trace_id="abc123" (sample of a 500 error)
Separate hot and cold paths:
- Hot path: Low cardinality, real-time metrics for dashboards
- Cold path: High cardinality, raw logs for investigation
OpenTelemetry: The Standard
OpenTelemetry (OTel) is the CNCF standard for observability instrumentation. It provides vendor-neutral APIs and SDKs for generating logs, metrics, and traces.
Why OpenTelemetry Matters
Before OpenTelemetry:
Use Datadog? → Instrument with Datadog SDK
Switch to New Relic? → Re-instrument everything
Add second vendor? → Duplicate instrumentation
With OpenTelemetry:
Instrument once with OTel SDK
Export to any backend (Datadog, New Relic, Prometheus, Jaeger)
Switch backends without changing code
OpenTelemetry Architecture
Components:
- API: Stable interface for instrumentation (logs, metrics, traces)
- SDK: Implementation of API with configuration
- Instrumentation Libraries: Auto-instrument common frameworks (HTTP, gRPC, databases)
- Collector: Receive, process, and export telemetry data
- Exporters: Send data to backends (Prometheus, Jaeger, CloudWatch, etc.)
Deployment patterns:
Direct export: Application sends telemetry directly to backend.
App → SDK → Exporter → Backend (Prometheus, Jaeger)
Collector sidecar: Collector runs alongside application.
App → SDK → Collector (sidecar) → Backend
Collector gateway: Centralized collector for multiple applications.
App → SDK → Collector (gateway) → Backend
Collector benefits:
- Buffering: Handle backend outages gracefully
- Sampling: Centralized tail-based sampling
- Processing: Enrich, filter, transform telemetry
- Multi-backend: Send same data to multiple backends
- Security: Centralize credentials and backend connections
OpenTelemetry Adoption Strategy
Phase 1: Auto-instrumentation
Start with zero-code instrumentation for common frameworks (HTTP servers, database clients, message queues).
Phase 2: Custom spans for business logic
Add manual spans for business-critical operations (payment processing, order fulfillment).
Phase 3: Custom metrics
Emit business metrics (orders/minute, revenue, cart abandonment rate).
Phase 4: Structured logging integration
Correlate logs with traces using trace IDs in log entries.
Phase 5: Collector deployment
Deploy OpenTelemetry Collector for centralized processing and multi-backend support.
Migration from existing instrumentation:
- Run OTel and legacy instrumentation in parallel
- Compare data quality and coverage
- Gradually switch traffic to OTel exporters
- Remove legacy instrumentation after validation
Observability-Driven Development
Observability is not just for production debugging. It guides development, testing, and architecture decisions.
Design for Observability
Questions to ask during design:
- How will we know if this feature is working in production?
- What metrics indicate success or failure?
- What trace spans will show bottlenecks?
- What logs help debug failures?
- What are the SLIs and SLOs for this feature?
Observable API design:
- Return error codes that distinguish client errors from server errors
- Include request IDs in all responses
- Emit metrics for request rate, latency, error rate
- Log structured data (not just error messages)
Observable database design:
- Track query performance metrics per query type
- Log slow queries with execution plans
- Trace database calls with query text (parameterized)
- Monitor connection pool utilization
Observable integration design:
- Propagate trace context to external services
- Set appropriate timeouts and retry policies
- Distinguish between retriable and non-retriable errors
- Emit metrics for third-party API health
Testing with Observability
Use observability to validate tests:
- Verify metrics emitted during integration tests
- Assert traces contain expected spans
- Check log output for expected structured fields
Chaos testing with observability:
- Inject failures (network latency, service errors)
- Verify SLOs are maintained
- Ensure observability data accurately reflects failures
Load testing with observability:
- Monitor metrics during load tests
- Identify bottlenecks through distributed tracing
- Validate SLOs under expected production load
Production Debugging Workflow
Standard debugging process with observability:
- Alert fires: Error budget burn rate exceeds threshold
- Check metrics: Identify which SLI is degraded (latency, error rate, availability)
- Correlate logs: Filter logs by time window and affected service
- Find traces: Look for failed or slow traces in the same window
- Analyze trace: Identify slow span or failing service
- Read logs: Examine logs from failing service with trace ID filter
- Reproduce: Use trace details to reproduce issue locally
- Fix and deploy: Implement fix, monitor metrics to confirm resolution
Observability accelerates debugging: Without observability, steps 2-6 require manual correlation, SSH into servers, and guesswork. With observability, the path from alert to root cause is direct.
Cost Management
Observability can become expensive at scale. Manage costs intentionally.
Cost Drivers
Storage:
- Logs: Highest volume, expensive to index and store
- Traces: High volume if not sampled aggressively
- Metrics: Lower volume but high cardinality increases cost
Ingestion:
- Some vendors charge per GB ingested
- High traffic systems generate terabytes/month
Queries:
- Scanning large log volumes is expensive
- Complex queries across billions of time series
Cost Optimization Strategies
Tiered retention:
- Hot tier (7 days): Full detail, fast queries, expensive storage
- Warm tier (30 days): Sampled data, slower queries, cheaper storage
- Cold tier (1 year): Aggregated data, archive storage, rare access
Sampling:
- Aggressively sample successful requests
- Keep all errors and slow requests
- Use tail-based sampling to retain interesting traces
Log filtering at source:
- Don’t send debug logs to centralized logging
- Filter noisy logs (health checks, static assets)
- Use local logging for non-critical information
Metric aggregation:
- Pre-aggregate before sending to backend
- Use histograms instead of individual samples
- Drop low-value metrics
Query optimization:
- Use indexed fields for filtering
- Limit time ranges
- Pre-aggregate data for dashboards
- Cache frequently-run queries
Right-size infrastructure:
- Adjust retention policies based on actual usage
- Archive old data to cheaper storage
- Use compression and columnar formats
Observability Maturity Model
Level 1: Reactive Monitoring
- Basic metrics and logs
- Manual correlation
- Alerting on simple thresholds
- Debugging is slow and manual
Level 2: Proactive Observability
- Distributed tracing implemented
- Centralized logging and metrics
- Automated correlation between signals
- SLOs defined but not enforced
Level 3: Reliability Engineering
- SLOs drive development priorities
- Error budgets guide release decisions
- Automated anomaly detection
- Observability integrated into CI/CD
Level 4: Continuous Optimization
- Observability-driven architecture decisions
- Predictive analytics and capacity planning
- Automated remediation based on telemetry
- Business metrics tied to technical metrics
Goal: Progress from Level 1 to Level 3. Level 4 is aspirational for most organizations.
Common Observability Antipatterns
Log Everything Syndrome
Problem: Excessive logging drowns signal in noise and drives up costs.
Solution: Use log levels appropriately. DEBUG stays in development. INFO for significant events only. ERROR for actual failures.
Alert Fatigue
Problem: Too many alerts, most are false positives, team ignores them all.
Solution: Alert on symptoms (SLO violations), not causes (disk space, CPU). Reduce alert noise until every alert is actionable.
Metric Explosion
Problem: Creating metrics for everything, including high cardinality dimensions.
Solution: Focus on metrics that tie to SLOs. Use exemplars and traces for high cardinality debugging.
Siloed Telemetry
Problem: Logs in one system, metrics in another, traces in a third. No correlation.
Solution: Use OpenTelemetry for unified instrumentation. Ensure trace IDs appear in logs and metrics.
Missing Context
Problem: Logs and traces lack essential context (user type, request origin, feature flags).
Solution: Include structured context in all telemetry. Tag resources consistently (service name, version, environment).
Optimizing for Known Failures
Problem: Only monitoring known failure modes. Surprised by unexpected issues.
Solution: Observability is about discovering unknown unknowns. Use tracing and logging to investigate novel failures.
Key Takeaways
SLOs drive observability strategy: Define what “good enough” means, measure it, and alert when error budget burns too fast.
Correlation is more valuable than individual signals: Logs, metrics, and traces are useful alone but powerful when correlated through trace IDs and resource attributes.
Sampling is essential at scale: You cannot trace every request. Sample intelligently—keep errors and slow requests, discard routine successes.
Manage cardinality intentionally: High cardinality dimensions (user IDs, unique identifiers) explode storage costs and query performance. Use bucketing, aggregation, and exemplars.
OpenTelemetry provides vendor neutrality: Instrument once with OTel, export to any backend. Switch vendors without re-instrumenting code.
Design systems to be observable: Observability is not bolted on afterward. Ask during design: “How will we know if this works in production?”
Observability enables reliability: Error budgets balance innovation and stability. SLOs make reliability measurable and actionable.
Cost management is critical: Observability can become expensive. Use tiered retention, aggressive sampling, and query optimization to control costs.
Progress through maturity levels: Start with basic monitoring (Level 1), add distributed tracing and SLOs (Level 2), integrate observability into development and release processes (Level 3).
Avoid common antipatterns: Don’t log everything, don’t create too many alerts, don’t ignore cardinality, don’t silo telemetry.
Found this guide helpful? Share it with your team:
Share on LinkedIn