Apache Kafka — Deep Dive

Kafka Architecture

Partition Internals

How It Works

Kafka is a distributed commit log, not a traditional message queue. Messages are appended to an immutable, ordered log (a partition) and are not deleted after consumption — they're retained based on time or size policy. This fundamental difference means multiple consumers can independently read the same data at different speeds.

Topics, Partitions, Offsets

A topic is a logical channel (e.g., orders). Each topic is split into partitions — the unit of parallelism. Within a partition, every message gets a monotonically increasing offset. Ordering is guaranteed within a partition only. Across partitions, there's no global order.

Why Is Kafka So Fast?

Sequential I/O — Writes append to the end of a file. HDDs do 100 MB/s sequential vs 100 IOPS random. Kafka treats disks like tape.
OS Page Cache — Kafka doesn't maintain its own buffer pool. It relies on the OS page cache, meaning data is often served from RAM without JVM GC overhead.
Zero-Copy — Uses sendfile() syscall to transfer data directly from page cache to the network socket, bypassing user-space entirely. Saves 2 copies + 2 context switches per read.
Batching — Producers batch messages before sending. Consumers fetch in batches. Fewer network round trips, better compression.

Consumer Groups & Partition Assignment

Each consumer in a consumer group is assigned a subset of partitions. Within a group, each partition is consumed by exactly one consumer. If you have 3 partitions and 3 consumers, each gets one partition. If you add a 4th consumer, it sits idle. If a consumer dies, its partitions are rebalanced to the remaining consumers.

Multiple consumer groups can independently read the same topic — this is how Kafka supports both "queue" and "pub/sub" semantics.

Producer Guarantees

Acknowledgment Modes (`acks`)

acks=0

Fire and forget. Producer doesn't wait for any acknowledgment. Maximum throughput, but messages can be lost if the broker crashes before writing to disk. Use for metrics/logs where loss is tolerable.

acks=1

Leader acknowledges after writing to its local log. If the leader crashes before replicas catch up, data is lost. Good balance of throughput and durability for most use cases.

acks=all (-1)

Leader waits until all in-sync replicas (ISR) have written the message. Strongest durability guarantee. Combined with min.insync.replicas=2, ensures at least 2 copies before ack.

Idempotent Producer

Set enable.idempotence=true. Each producer gets a PID + sequence number. Broker deduplicates retries. Prevents duplicates from network retries. Required for exactly-once.

Batching & Performance Tuning

linger.ms controls how long the producer waits to fill a batch before sending. batch.size sets the max batch size in bytes. Higher linger = better throughput but higher latency. Typical: linger.ms=5, batch.size=64KB.

Key-Based Partitioning

If you set a message key (e.g., user_id), Kafka hashes it to determine the partition: partition = hash(key) % num_partitions. All messages with the same key go to the same partition, guaranteeing order for that key. This is how you ensure all events for a given user/order are processed in sequence.

Consumer Guarantees

At-Most-Once

Commit offset before processing. If processing fails, the message is skipped (lost). Simple but risky. Acceptable for non-critical analytics.

At-Least-Once

Process then commit offset. If consumer crashes after processing but before commit, the message is re-processed on restart. Most common pattern — requires idempotent consumers.

Exactly-Once (EOS)

Uses transactional producer + consumer in a read-process-write loop. isolation.level=read_committed. Works for Kafka-to-Kafka pipelines. For external sinks, you still need idempotent writes.

Auto vs Manual Commit

enable.auto.commit=true commits every 5s by default — easy but risks at-most-once on crash. Manual commit (commitSync() / commitAsync()) gives you control over when exactly to mark progress.

Consumer Rebalancing

When consumers join/leave a group (or crash), a rebalance redistributes partitions. During rebalance, the group stops consuming — a "stop-the-world" event. Kafka 2.4+ introduced incremental cooperative rebalancing to minimize this pause: only affected partitions are revoked, not all of them.

Consumer Lag

Lag = latest offset − consumer's committed offset. High lag means the consumer can't keep up. Monitor via kafka-consumer-groups.sh --describe or tools like Burrow. If lag grows unboundedly, you need more consumers (up to partition count) or need to optimize processing.

Replication & Fault Tolerance

📋

Replication Factor: Each partition has N replicas spread across different brokers. Standard production setup: replication.factor=3. One replica is the leader (handles all reads/writes), the rest are followers that replicate from the leader.

✅

In-Sync Replicas (ISR): Replicas that are caught up to the leader within replica.lag.time.max.ms (default 30s). Only ISR members can become the new leader. If a follower falls behind, it's removed from ISR. The leader only considers a write "committed" when all ISR members have it.

🛡️

min.insync.replicas: Minimum ISR count required for a write to succeed. With replication.factor=3 and min.insync.replicas=2, you can tolerate 1 broker failure without data loss and without halting writes. If ISR drops below this, producers get NotEnoughReplicasException.

👑

Leader Election: When the leader broker goes down, the controller (a designated broker) picks a new leader from the ISR. This happens within seconds.

⚠️

Unclean Leader Election: If all ISR members are down, Kafka can either wait (availability loss) or elect a non-ISR replica (unclean.leader.election.enable=true). The tradeoff: availability vs potential data loss. Default is false — correctness over availability.

Key Design Decisions

Kafka vs RabbitMQ vs SQS

Choose Kafka When

You need high throughput (100K+ msg/s), log replay, event sourcing, stream processing, multiple consumers reading the same data, or long retention. Kafka excels at "event backbone" use cases.

Choose RabbitMQ / SQS When

You need traditional task queues, complex routing (fanout/topic exchanges), message-level acknowledgment, low latency for small volumes, or simple request-reply patterns. SQS for zero-ops on AWS.

Partition Count

Partitions determine max consumer parallelism. You can increase partition count but never decrease it (would break key-based routing). Start with num.partitions = expected_throughput / partition_throughput. Rule of thumb: 10–30 partitions per topic for most workloads. Don't go over 10K partitions per broker — metadata overhead grows.

Retention vs Compaction

Time/Size Retention

retention.ms (default 7 days) or retention.bytes. Old segments are deleted. Good for event streams where you only care about recent data.

Log Compaction

Keeps the latest value for each key. Background thread removes older records with same key. Perfect for changelog/CDC topics where you want current state forever (e.g., user profiles).

Schema Evolution

Use Apache Avro + Schema Registry for typed, evolvable messages. Schemas are registered centrally and referenced by ID in messages (saves bandwidth vs embedding schema). Supports backward/forward/full compatibility modes. Prevents producers from breaking consumers with incompatible changes.

When to Use

Event Sourcing — Store every state change as an immutable event. Replay to rebuild state. Kafka's retention + immutability is a natural fit.
Change Data Capture (CDC) — Stream database changes (inserts/updates/deletes) to downstream systems via Debezium + Kafka Connect. Compacted topics keep the latest state.
Stream Processing — Real-time transformations with Kafka Streams or Flink. Join streams, window aggregations, exactly-once processing.
Log Aggregation — Collect logs from hundreds of services into Kafka, then route to Elasticsearch, S3, or data warehouses. Replaces fragile syslog pipelines.
Microservices Decoupling — Services publish events instead of calling each other. Reduces temporal coupling. A service can be down and catch up later.

Interview recognition signals: If the interviewer mentions "event-driven," "stream processing," "real-time pipeline," "decoupling services," or "replay events" — Kafka is almost certainly the expected answer. Show you understand why it's a commit log and not just another queue.

Real-World Examples

LinkedIn — Created Kafka. Processes 7+ trillion messages/day across 100+ clusters. Powers activity feeds, metrics, CDC, and ML pipelines.
Uber — Runs trillions of messages/day across multiple Kafka deployments. Powers real-time pricing, rider-driver matching events, and trip analytics.
Netflix — Uses Kafka for real-time event processing, data pipeline ingestion, and stream processing. Handles billions of events/day for recommendations and A/B testing.
Spotify — Kafka powers event delivery for user interactions, ad serving, and internal analytics. Core infrastructure for their data platform.
Stripe — Uses Kafka for exactly-once payment event processing, audit logs, and cross-service event propagation. Financial-grade reliability requirements.

Back-of-Envelope Numbers

Metric	Value
Single broker throughput (write)	~200 MB/s (sequential disk)
Single broker throughput (messages)	~200K–800K msg/s (1KB msgs)
Max partitions per broker (recommended)	~4,000–10,000
Max message size (default)	1 MB (configurable)
Replication overhead	~2× storage & network (RF=3)
Leader election time	~5–30 seconds
End-to-end latency (99th %ile)	~2–10 ms (acks=1)
Consumer lag (healthy threshold)	< 1,000 messages
Default retention	7 days (168 hours)
LinkedIn daily volume	7+ trillion messages

🧪 Try It Yourself

Learn by doing — download the Kafka Lab, a ready-to-run playground with a 3-broker cluster, Kotlin client, and 7 guided experiments covering everything on this page.

🧪 Open Kafka Lab

📨 Apache Kafka — Deep Dive