⚠️ This guide is AI-generated and may contain inaccuracies. Always verify against authoritative sources and real-world documentation.

Architecture Diagram

Order Service Creates orders REST / gRPC CIRCUIT BREAKER CLOSED OPEN sync (~20ms) Payment Service Charges card publish event Message Queue Kafka / RabbitMQ consume Notification Svc Sends email/push SYNCHRONOUS PATH User waits. Must be fast. Circuit breaker protects. ASYNCHRONOUS PATH Fire-and-forget. Retry on failure. When CB OPENS (50% errors): β†’ Return fallback immediately β†’ Don't call Payment (60s cooldown) β†’ Half-open: try 1 request to test API Gateway Auth, routing, rate limit

How It Works

Microservices communicate via two fundamental styles. The key insight: split your request flow into a synchronous path (user-facing, must be fast) and an asynchronous path (background, can be slow). Protect sync calls with circuit breakers.

Synchronous vs Asynchronous

Synchronous (REST / gRPC)

Service A calls Service B and waits for a response. Simple to understand and debug. But: creates temporal coupling — if B is slow or down, A is stuck. Chains (A→B→C→D) compound latency. Use for user-facing calls that must be fast.

Asynchronous (Message Queue / Events)

Service A publishes an event and continues without waiting. Decoupled: B processes at its own pace. Handles failures via retry + dead letter queue. But: harder to debug, eventual consistency. Use for background work.

Circuit Breaker Pattern

If a downstream service fails repeatedly, stop calling it. The circuit breaker has three states:

  1. Closed (normal) β€” Requests flow through. Track error rate.
  2. Open (tripped) β€” If error rate exceeds threshold (e.g., 50% in 30s), stop calling. Return fallback/error immediately. Timer starts (e.g., 60s cooldown).
  3. Half-Open (testing) β€” After cooldown, allow one test request. If it succeeds β†’ Closed. If it fails β†’ Open again.

Other Key Patterns

  • API Gateway β€” Single entry point for clients. Handles auth, routing, rate limiting, protocol translation. Kong, AWS API Gateway.
  • Service Discovery β€” Services register themselves; others look them up. Client-side (Eureka) or server-side (K8s DNS, Consul).
  • Saga β€” Distributed transaction as a sequence of local transactions. Each step has a compensating action for rollback. Orderβ†’Paymentβ†’Inventory, with cancel-order as compensation.
  • Sidecar / Service Mesh β€” Proxy alongside each service handles networking (Envoy/Istio). Circuit breaking, retries, mTLS without code changes.

Key Design Decisions

⚑

Sync vs async for payments: Synchronous = user waits for payment confirmation (slow but certain). Async = instant "order placed!" but payment might fail later (fast but requires compensation). DoorDash chose async + saga: show "order confirmed" immediately, handle payment failure via cancellation.

πŸ”„

Retry strategy: Simple retries can cause thundering herd (all clients retry at once). Use exponential backoff with jitter: delay = min(base * 2^attempt + random_jitter, max_delay). Set a retry budget (e.g., max 3 retries, max 20% of requests are retries).

πŸ”

Distributed tracing: When a request touches 6+ services, debugging is impossible without tracing. Assign a correlation ID at the gateway, propagate through all calls. Tools: Jaeger, Zipkin, OpenTelemetry. Non-negotiable for microservices.

πŸ—οΈ

Monolith first: Don't start with microservices. Start monolithic, extract services when you have clear domain boundaries and team scaling needs. A well-structured monolith beats a poorly designed microservices architecture every time.

When to Use

  • "How do services communicate?" β€” Discuss sync vs async tradeoffs with concrete examples.
  • "What happens when Service B is down?" β€” Circuit breaker + retry with backoff + fallback response.
  • "How do you maintain consistency across services?" β€” Saga pattern + outbox for reliable event publishing.
  • "How does the client know which service to call?" β€” API Gateway for external clients, service discovery for internal.

Interview signal: Draw the sequence diagram split into sync and async paths. For every async step, explain what happens on failure (saga compensation). Mention circuit breakers for sync calls without being asked β€” interviewers love hearing about it.

Real-World Examples

  • Netflix Hystrix / Resilience4j β€” Netflix invented the circuit breaker library because cascading failures across their ~700 microservices kept causing outages. Every Netflix API call goes through a circuit breaker.
  • DoorDash β€” Order flow touches 6 services. Menu + Pricing (sync gRPC, <100ms). Payment + Dispatch + Notification (async via Kafka). Circuit breaker on Menu Service with cached fallback.
  • Envoy Proxy (Lyft/Istio) β€” Implements circuit breaking, retries, and load balancing as a sidecar. Services don't need library-level changes. All inter-service traffic routes through Envoy.
  • Uber β€” ~2000+ microservices. Strict timeout budgets per call. Request hedging for latency-critical paths. Full distributed tracing with Jaeger (which Uber built).

Back-of-Envelope Numbers

Metric Value
DoorDash orders/sec (dinner peak)~200/sec (each touching 6 services)
Sync call latency budget (user-facing)p99 < 100ms
Async processing time (background)p99 < 5 seconds
Circuit breaker error threshold50% error rate in 30s window
Circuit breaker cooldown60 seconds, then half-open
Retry budgetMax 3 retries, ≀20% of total requests
Netflix microservices count~700+ services