Observability — System Design Pattern

Architecture Diagram

How It Works

Observability is about understanding what's happening inside your distributed system from the outside. Three complementary pillars each answer a different question when something goes wrong.

The Three Pillars

📋 Logs — "WHY did it happen?"

Discrete events with timestamp and context. Structured logging (JSON) over unstructured text. Ship to Elasticsearch/Loki/CloudWatch. Use log levels: ERROR (pages you), WARN (needs attention), INFO (business events), DEBUG (development only).

📈 Metrics — "WHAT is broken?"

Numeric measurements over time. Request count, p99 latency, error rate, CPU usage. Time-series DBs: Prometheus, Datadog. Alert on thresholds. RED method: Rate, Error rate, Duration. USE method: Utilization, Saturation, Errors.

🔍 Traces — "WHERE is the bottleneck?"

Follow a request across services. Each request gets a trace_id; each service adds a span with timing. Visualizes the full call chain as a waterfall. Tools: Jaeger, Zipkin, OpenTelemetry. Essential for microservices debugging.

🔗 Correlation — Connecting the Pillars

Include trace_id in all logs so you can jump: dashboard (metric spike) → trace (which service is slow) → logs (exact error message). Without correlation, you're searching haystacks across three separate systems.

Debugging Flow (Real-World)

Alert fires: "ride-request API p99 latency > 2s" — Metrics tell you WHAT is broken.
Check Grafana dashboard: p99 spiked at 14:03. Overlay dependent services — pricing service also spiked → likely culprit.
Pull a slow trace (Jaeger): Waterfall shows ride-request → pricing-service took 1.8s (normally 20ms).
Drill into pricing span: It's waiting 1.7s on a Redis call → check Redis metrics.
Root cause: Redis at 98% memory, evicting keys, latency p99 = 500ms → a new feature cached too much data.

Key Design Decisions

🎯

Sampling rate: 100% tracing = perfect visibility but massive storage and ~5ms overhead per span. 1% sampling = cheap but you miss rare issues. Solution: adaptive sampling — 100% for errors, 100% for slow requests (>1s), 1% for normal requests.

📊

Metric cardinality: Don't use high-cardinality labels. user_id as a metric label = millions of time series = Prometheus explodes. Use high-cardinality data in logs/traces, not metrics. Metrics should have bounded label sets.

💰

Log volume management: At scale, logging everything is prohibitively expensive (~1PB/day at Uber). Sample high-traffic paths (10%), keep 100% for payment/auth. Structured logging enables efficient querying without storing full text.

🔧

OpenTelemetry vs vendor-specific: OpenTelemetry provides a single SDK for all three pillars with vendor-agnostic export. Avoids lock-in to Datadog/New Relic/Splunk. The industry is converging on OTel as the standard.

When to Use

Mentioning observability in a system design interview shows operational maturity — you're not just designing the happy path.

"How do you debug issues in your distributed system?" — Three pillars: metrics (what), traces (where), logs (why)
"How do you know when something is wrong?" — Metrics + alerting (p99 latency, error rate, saturation)
"A user reports the app is slow — how do you investigate?" — Pull their trace_id → waterfall shows the slow service → logs show the error
"How do you handle monitoring for 2000 microservices?" — Standardized instrumentation (OpenTelemetry), RED dashboards per service, adaptive sampling

Interview signal: Walk through a concrete debugging flow: "I'd check metrics first for which service, then pull a trace to find the slow span, then check that span's logs for the error." This structured approach is exactly how senior engineers debug.

Real-World Examples

Uber (Jaeger) — Built Jaeger (now CNCF project) because debugging latency across 2000+ microservices was impossible without distributed tracing. A single ride request touches ~20 services — Jaeger shows the full call tree with timing.
Prometheus + Grafana — The standard observability stack for Kubernetes. Nearly every K8s deployment uses Prometheus for metrics and Grafana for dashboards. The RED method originated from this ecosystem.
Google (Dapper) — Published the foundational distributed tracing paper (2010). Traces all requests across Google's infrastructure with minimal overhead. Inspired Zipkin, Jaeger, and OpenTelemetry.
Netflix — Uses a combination of Atlas (metrics), Mantis (real-time stream processing on logs), and custom tracing to monitor their 700+ microservices globally.

Back-of-Envelope Numbers

Metric	Value
Uber log volume (compressed)	~1 PB/day
Metrics per microservice	~100 time series
2000 services × 100 metrics	~200K time series
Trace storage (1% sampling)	~50 GB/day (10M traces × 5KB)
Tracing overhead per span	~1–5 ms
MTTD (Mean Time to Detect)	<2 minutes (automated alerting)
MTTR (Mean Time to Resolve)	~15–30 min (with good observability)
Prometheus scrape interval	15–30 seconds

👁️ Observability