โš ๏ธ This guide is AI-generated and may contain inaccuracies. Always verify against authoritative sources and real-world documentation.

Architecture Diagram

REQUEST FLOW (trace_id: abc123) Client API Gateway span: 0-200ms Order Service span: 20-180ms Payment Svc span: 40-170ms ๐ŸŒ TRACE WATERFALL API Gateway (200ms) โ† Payment (130ms) bottleneck! THREE PILLARS OF OBSERVABILITY ๐Ÿ“‹ LOGS Discrete events Structured JSON Elasticsearch / Loki WHY did it happen? ๐Ÿ“ˆ METRICS Numeric time series RED: Rate, Errors, Duration Prometheus / Grafana WHAT is broken? ๐Ÿ” TRACES Request path across services Spans with timing Jaeger / Zipkin WHERE is the bottleneck? ๐Ÿ”— Correlation: trace_id in all logs, metrics, traces Jump from dashboard โ†’ trace โ†’ log seamlessly OpenTelemetry: unified collection for all three pillars

How It Works

Observability is about understanding what's happening inside your distributed system from the outside. Three complementary pillars each answer a different question when something goes wrong.

The Three Pillars

๐Ÿ“‹ Logs โ€” "WHY did it happen?"

Discrete events with timestamp and context. Structured logging (JSON) over unstructured text. Ship to Elasticsearch/Loki/CloudWatch. Use log levels: ERROR (pages you), WARN (needs attention), INFO (business events), DEBUG (development only).

๐Ÿ“ˆ Metrics โ€” "WHAT is broken?"

Numeric measurements over time. Request count, p99 latency, error rate, CPU usage. Time-series DBs: Prometheus, Datadog. Alert on thresholds. RED method: Rate, Error rate, Duration. USE method: Utilization, Saturation, Errors.

๐Ÿ” Traces โ€” "WHERE is the bottleneck?"

Follow a request across services. Each request gets a trace_id; each service adds a span with timing. Visualizes the full call chain as a waterfall. Tools: Jaeger, Zipkin, OpenTelemetry. Essential for microservices debugging.

๐Ÿ”— Correlation โ€” Connecting the Pillars

Include trace_id in all logs so you can jump: dashboard (metric spike) โ†’ trace (which service is slow) โ†’ logs (exact error message). Without correlation, you're searching haystacks across three separate systems.

Debugging Flow (Real-World)

  1. Alert fires: "ride-request API p99 latency > 2s" โ€” Metrics tell you WHAT is broken.
  2. Check Grafana dashboard: p99 spiked at 14:03. Overlay dependent services โ€” pricing service also spiked โ†’ likely culprit.
  3. Pull a slow trace (Jaeger): Waterfall shows ride-request โ†’ pricing-service took 1.8s (normally 20ms).
  4. Drill into pricing span: It's waiting 1.7s on a Redis call โ†’ check Redis metrics.
  5. Root cause: Redis at 98% memory, evicting keys, latency p99 = 500ms โ†’ a new feature cached too much data.

Key Design Decisions

๐ŸŽฏ

Sampling rate: 100% tracing = perfect visibility but massive storage and ~5ms overhead per span. 1% sampling = cheap but you miss rare issues. Solution: adaptive sampling โ€” 100% for errors, 100% for slow requests (>1s), 1% for normal requests.

๐Ÿ“Š

Metric cardinality: Don't use high-cardinality labels. user_id as a metric label = millions of time series = Prometheus explodes. Use high-cardinality data in logs/traces, not metrics. Metrics should have bounded label sets.

๐Ÿ’ฐ

Log volume management: At scale, logging everything is prohibitively expensive (~1PB/day at Uber). Sample high-traffic paths (10%), keep 100% for payment/auth. Structured logging enables efficient querying without storing full text.

๐Ÿ”ง

OpenTelemetry vs vendor-specific: OpenTelemetry provides a single SDK for all three pillars with vendor-agnostic export. Avoids lock-in to Datadog/New Relic/Splunk. The industry is converging on OTel as the standard.

When to Use

Mentioning observability in a system design interview shows operational maturity โ€” you're not just designing the happy path.

  • "How do you debug issues in your distributed system?" โ€” Three pillars: metrics (what), traces (where), logs (why)
  • "How do you know when something is wrong?" โ€” Metrics + alerting (p99 latency, error rate, saturation)
  • "A user reports the app is slow โ€” how do you investigate?" โ€” Pull their trace_id โ†’ waterfall shows the slow service โ†’ logs show the error
  • "How do you handle monitoring for 2000 microservices?" โ€” Standardized instrumentation (OpenTelemetry), RED dashboards per service, adaptive sampling

Interview signal: Walk through a concrete debugging flow: "I'd check metrics first for which service, then pull a trace to find the slow span, then check that span's logs for the error." This structured approach is exactly how senior engineers debug.

Real-World Examples

  • Uber (Jaeger) โ€” Built Jaeger (now CNCF project) because debugging latency across 2000+ microservices was impossible without distributed tracing. A single ride request touches ~20 services โ€” Jaeger shows the full call tree with timing.
  • Prometheus + Grafana โ€” The standard observability stack for Kubernetes. Nearly every K8s deployment uses Prometheus for metrics and Grafana for dashboards. The RED method originated from this ecosystem.
  • Google (Dapper) โ€” Published the foundational distributed tracing paper (2010). Traces all requests across Google's infrastructure with minimal overhead. Inspired Zipkin, Jaeger, and OpenTelemetry.
  • Netflix โ€” Uses a combination of Atlas (metrics), Mantis (real-time stream processing on logs), and custom tracing to monitor their 700+ microservices globally.

Back-of-Envelope Numbers

Metric Value
Uber log volume (compressed)~1 PB/day
Metrics per microservice~100 time series
2000 services ร— 100 metrics~200K time series
Trace storage (1% sampling)~50 GB/day (10M traces ร— 5KB)
Tracing overhead per span~1โ€“5 ms
MTTD (Mean Time to Detect)<2 minutes (automated alerting)
MTTR (Mean Time to Resolve)~15โ€“30 min (with good observability)
Prometheus scrape interval15โ€“30 seconds