Decouple producers from consumers with asynchronous messaging. Enable load leveling, fault isolation, and event-driven architecture at any scale.
Medium Very High FrequencyProducers send messages to a broker (Kafka, RabbitMQ, SQS). Consumers read from the broker asynchronously. The broker provides buffering, persistence, and delivery guarantees โ decoupling the two sides completely.
Each consumer group gets a copy of every message. Used for fan-out: one "order placed" event triggers email, analytics, inventory, and billing โ independently. Kafka's primary model.
Each message is delivered to exactly one consumer. Used for work distribution: 1000 image resize jobs โ 10 workers each process ~100. RabbitMQ and SQS's primary model.
Global ordering across all messages is expensive (single partition bottleneck). Instead, use partition-level ordering: hash a key (e.g., user_id) to a partition. All messages for that key arrive in order. Different keys may interleave.
The difference between the latest produced offset and the consumer's current offset. High lag means consumers can't keep up. Monitor it โ it's the #1 indicator of queue health. Solutions: add more consumers (up to partition count), optimize processing, or increase partition count.
Kafka vs RabbitMQ: Kafka: log-based, append-only, replay-friendly, partitioned. Best for event streaming, high throughput, and reprocessing. RabbitMQ: traditional queue, push-based, flexible routing (exchanges). Best for task queues, RPC, and complex routing logic.
Partition count: More partitions = more parallelism (max consumers = partition count) but more broker overhead and longer leader election. Start with #partitions = expected peak consumer count ร 2.
Retention policy: Kafka can retain messages for days/weeks (log compaction or time-based). SQS retains up to 14 days. Longer retention = ability to replay events and rebuild state, but more storage cost.
Dead Letter Queue (DLQ): Messages that fail processing N times go to a DLQ for inspection. Essential for debugging. Without it, poison messages block the queue or are silently dropped.
Message queues appear in almost every system design that needs async processing or service decoupling.
Interview signal: Show you understand when sync (HTTP) vs async (queue) is appropriate, and discuss idempotency, ordering, and failure handling.
| Metric | Value |
|---|---|
| Kafka throughput (single broker) | ~200Kโ800K msgs/sec |
| Kafka throughput (cluster, sustained) | ~millions msgs/sec |
| Kafka latency (producer โ consumer) | ~2โ10 ms (p99) |
| RabbitMQ throughput | ~20Kโ50K msgs/sec |
| SQS throughput (standard) | Unlimited (soft) |
| SQS throughput (FIFO) | ~3000 msgs/sec/queue |
| Kafka message size (typical) | ~1 KB (max 1 MB default) |
| LinkedIn Kafka (2023) | 7+ trillion msgs/day |
| Consumer lag threshold (alert) | Project-specific, often > 10K |