The three pillars of understanding production systems: Logs, Metrics, and Distributed Traces. How to detect, locate, and diagnose problems across thousands of microservices.
Medium Medium FrequencyObservability is about understanding what's happening inside your distributed system from the outside. Three complementary pillars each answer a different question when something goes wrong.
Discrete events with timestamp and context. Structured logging (JSON) over unstructured text. Ship to Elasticsearch/Loki/CloudWatch. Use log levels: ERROR (pages you), WARN (needs attention), INFO (business events), DEBUG (development only).
Numeric measurements over time. Request count, p99 latency, error rate, CPU usage. Time-series DBs: Prometheus, Datadog. Alert on thresholds. RED method: Rate, Error rate, Duration. USE method: Utilization, Saturation, Errors.
Follow a request across services. Each request gets a trace_id; each service adds a span with timing. Visualizes the full call chain as a waterfall. Tools: Jaeger, Zipkin, OpenTelemetry. Essential for microservices debugging.
Include trace_id in all logs so you can jump: dashboard (metric spike) โ trace (which service is slow) โ logs (exact error message). Without correlation, you're searching haystacks across three separate systems.
Sampling rate: 100% tracing = perfect visibility but massive storage and ~5ms overhead per span. 1% sampling = cheap but you miss rare issues. Solution: adaptive sampling โ 100% for errors, 100% for slow requests (>1s), 1% for normal requests.
Metric cardinality: Don't use high-cardinality labels. user_id as a metric label = millions of time series = Prometheus explodes. Use high-cardinality data in logs/traces, not metrics. Metrics should have bounded label sets.
Log volume management: At scale, logging everything is prohibitively expensive (~1PB/day at Uber). Sample high-traffic paths (10%), keep 100% for payment/auth. Structured logging enables efficient querying without storing full text.
OpenTelemetry vs vendor-specific: OpenTelemetry provides a single SDK for all three pillars with vendor-agnostic export. Avoids lock-in to Datadog/New Relic/Splunk. The industry is converging on OTel as the standard.
Mentioning observability in a system design interview shows operational maturity โ you're not just designing the happy path.
Interview signal: Walk through a concrete debugging flow: "I'd check metrics first for which service, then pull a trace to find the slow span, then check that span's logs for the error." This structured approach is exactly how senior engineers debug.
| Metric | Value |
|---|---|
| Uber log volume (compressed) | ~1 PB/day |
| Metrics per microservice | ~100 time series |
| 2000 services ร 100 metrics | ~200K time series |
| Trace storage (1% sampling) | ~50 GB/day (10M traces ร 5KB) |
| Tracing overhead per span | ~1โ5 ms |
| MTTD (Mean Time to Detect) | <2 minutes (automated alerting) |
| MTTR (Mean Time to Resolve) | ~15โ30 min (with good observability) |
| Prometheus scrape interval | 15โ30 seconds |