Message Queues — System Design Pattern

Architecture Diagram

How It Works

Producers send messages to a broker (Kafka, RabbitMQ, SQS). Consumers read from the broker asynchronously. The broker provides buffering, persistence, and delivery guarantees — decoupling the two sides completely.

Pub/Sub vs Queue

Pub/Sub (Topic)

Each consumer group gets a copy of every message. Used for fan-out: one "order placed" event triggers email, analytics, inventory, and billing — independently. Kafka's primary model.

Queue (Competing Consumers)

Each message is delivered to exactly one consumer. Used for work distribution: 1000 image resize jobs → 10 workers each process ~100. RabbitMQ and SQS's primary model.

Delivery Guarantees

At-most-once: Fire and forget. Message may be lost. Fastest. Good for metrics/logs where drops are OK.
At-least-once: Broker retries until ACK. Message may be delivered multiple times. Consumer must be idempotent. Most common in practice (Kafka default, SQS).
Exactly-once: Extremely hard in distributed systems. Kafka achieves it within its ecosystem using transactional producers + idempotent consumers. Not achievable across system boundaries.

Ordering

Global ordering across all messages is expensive (single partition bottleneck). Instead, use partition-level ordering: hash a key (e.g., user_id) to a partition. All messages for that key arrive in order. Different keys may interleave.

Consumer Lag

The difference between the latest produced offset and the consumer's current offset. High lag means consumers can't keep up. Monitor it — it's the #1 indicator of queue health. Solutions: add more consumers (up to partition count), optimize processing, or increase partition count.

Key Design Decisions

🐰

Kafka vs RabbitMQ: Kafka: log-based, append-only, replay-friendly, partitioned. Best for event streaming, high throughput, and reprocessing. RabbitMQ: traditional queue, push-based, flexible routing (exchanges). Best for task queues, RPC, and complex routing logic.

📊

Partition count: More partitions = more parallelism (max consumers = partition count) but more broker overhead and longer leader election. Start with #partitions = expected peak consumer count × 2.

💾

Retention policy: Kafka can retain messages for days/weeks (log compaction or time-based). SQS retains up to 14 days. Longer retention = ability to replay events and rebuild state, but more storage cost.

🔁

Dead Letter Queue (DLQ): Messages that fail processing N times go to a DLQ for inspection. Essential for debugging. Without it, poison messages block the queue or are silently dropped.

When to Use

Message queues appear in almost every system design that needs async processing or service decoupling.

"Design a notification system" — Queue notification events, fan-out to push/email/SMS services
"Design an e-commerce platform" — Order events → inventory, payment, shipping (pub/sub)
"Handle a traffic spike" — Queue absorbs burst, workers process at steady rate (load leveling)
"Design a log aggregation pipeline" — Kafka as the central bus (producers: apps → Kafka → consumers: Elasticsearch, S3)
"Build an event sourcing system" — Kafka as the immutable event log

Interview signal: Show you understand when sync (HTTP) vs async (queue) is appropriate, and discuss idempotency, ordering, and failure handling.

Real-World Examples

LinkedIn — Created Kafka. Processes 7+ trillion messages/day across 100+ clusters. Powers activity tracking, metrics, and data pipelines.
Uber — Uses Kafka for trip events, driver location updates, surge pricing calculations. Billions of events/day with strict ordering per trip.
Netflix — Kafka for real-time analytics and event processing. Also uses SQS for encoding pipeline job distribution.
Shopify — Kafka for order events processed by 100+ microservices. Pub/sub ensures each service (inventory, billing, shipping) gets every order event.

Back-of-Envelope Numbers

Metric	Value
Kafka throughput (single broker)	~200K–800K msgs/sec
Kafka throughput (cluster, sustained)	~millions msgs/sec
Kafka latency (producer → consumer)	~2–10 ms (p99)
RabbitMQ throughput	~20K–50K msgs/sec
SQS throughput (standard)	Unlimited (soft)
SQS throughput (FIFO)	~3000 msgs/sec/queue
Kafka message size (typical)	~1 KB (max 1 MB default)
LinkedIn Kafka (2023)	7+ trillion msgs/day
Consumer lag threshold (alert)	Project-specific, often > 10K

📨 Message Queues