MapReduce & Stream Processing — System Design Pattern

Architecture Diagram — Word Count Example

How It Works

MapReduce processes large datasets by splitting work into three phases across a cluster of machines. The framework handles distribution, fault tolerance, and data movement — you just write the Map and Reduce functions.

The Three Phases

Map — Input is split into chunks, each processed by a mapper in parallel. Each mapper transforms records into (key, value) pairs. Example: for word count, each word becomes ("word", 1).
Shuffle & Sort — The framework groups all values by key across all mappers. All pairs with the same key are sent to the same reducer. This is the most network-intensive phase.
Reduce — Each reducer receives a key and all its values, then aggregates them. Example: sum all the 1s for each word to get final counts.

Batch vs Stream Processing

Batch Processing (MapReduce / Spark)

Process bounded datasets in bulk. Latency: minutes to hours. Good for: analytics, ETL, building search indexes, ML training. Spark does this in-memory (10-100× faster than Hadoop MapReduce).

Stream Processing (Flink / Kafka Streams)

Process unbounded data in real-time as it arrives. Latency: ms to seconds. Good for: fraud detection, real-time dashboards, event-driven workflows. Windowing: tumbling, sliding, session.

Lambda vs Kappa Architecture

Lambda — Batch layer (accurate, slow) + speed layer (approximate, fast) + serving layer. Complex: two codebases for the same logic.
Kappa — Everything is a stream. Batch = replaying the stream from the beginning. Simpler. Kafka enables this with long retention. Preferred in modern systems.

Key Design Decisions

⚡

Batch vs Stream: Streaming gives real-time results but is complex (exactly-once, late data, state management). Batch is simpler, cheaper, and accurate — but delayed. Use both: streaming for real-time metrics, batch for accuracy-critical analytics. Batch is still right for 80% of analytics workloads.

💾

Hadoop vs Spark: Hadoop MapReduce writes intermediate results to disk between stages — reliable but slow. Spark keeps data in memory — 10-100× faster. Use Spark for iterative workloads (ML, graph). Hadoop for extremely large datasets that don't fit in cluster memory.

🔄

Exactly-once in streaming: Possible with Flink/Kafka Streams but requires careful design (transactions, idempotent sinks). Once you call an external API, true exactly-once is impossible — use idempotent consumers instead of chasing the perfect guarantee.

⏰

Late data handling: Real events arrive out of order. Watermarks define "how late is too late." Events before the watermark are processed normally; events after are dropped or sent to a side output. Choosing the watermark delay is a tradeoff between completeness and latency.

When to Use

MapReduce / stream processing appears in interviews about data-intensive systems. Know when to pick batch vs stream.

"How do you process terabytes of log data?" — Batch MapReduce (Spark) for daily aggregation
"Design a real-time trending topics system" — Stream processing with sliding windows on hashtag counts
"Build a recommendation engine" — Batch for training (Spark ML), stream for real-time signals
"Count word frequencies across petabytes" — Classic MapReduce: Map emits (word, 1), Reduce sums

Interview signal: Explain MapReduce as three concrete phases with your scenario's data. Then explain why you'd use Spark over Hadoop. Mention Kappa architecture as the modern alternative.

Real-World Examples

Spotify Wrapped — Full-year replay job processes 365 days of listening data (~430TB) on 2,000 Spark executors in ~4 hours. Daily batch jobs process 6B events/day for per-user analytics.
Google PageRank — Original MapReduce use case. Iterative computation over the entire web graph. Modern equivalent: Spark GraphX or Pregel.
Twitter Trending Topics — Kafka Streams processes ~500K tweets/minute with sliding windows. Counts hashtags, emits top-K every 5 seconds.
Uber Real-Time Surge Pricing — Apache Flink processes ride request events in real-time. Computes supply/demand ratio per geographic cell per minute to set dynamic pricing.

Back-of-Envelope Numbers

Metric	Value
Spotify daily events	6B events/day (~1.2 TB)
Spotify Wrapped yearly job	~430 TB, ~4 hours on 2K executors
Spark executor throughput	~50-200 MB/s per core (in-memory)
Hadoop MapReduce throughput	~10-50 MB/s per core (disk I/O bound)
Kafka Streams latency	~1-10 ms per event
Flink exactly-once overhead	~5-10% throughput reduction
Shuffle network bandwidth	Bottleneck: often 10 Gbps NIC limit
Spark vs Hadoop for iterative jobs	10-100× faster (in-memory)

🔄 MapReduce & Stream Processing