โš ๏ธ This guide is AI-generated and may contain inaccuracies. Always verify against authoritative sources and real-world documentation.

Architecture Diagram โ€” Word Count Example

INPUT MAP SHUFFLE REDUCE OUTPUT Split 1 "hello world" Split 2 "hello foo" Split 3 "world foo bar" Mapper 1 (hello,1) (world,1) Mapper 2 (hello,1) (foo,1) Mapper 3 (world,1)(foo,1) (bar,1) hello [1, 1] world [1, 1] foo [1, 1] bar [1] Reducer 1 sum([1,1]) Reducer 2 sum([1,1]) Reducer 3 sum([1,1]) Reducer 4 sum([1]) hello 2 world 2 foo 2 bar 1 Three Phases of MapReduce MAP: Transform each record โ†’ (key, value) pairs SHUFFLE: Group by key across all mappers REDUCE: Aggregate values for each key

How It Works

MapReduce processes large datasets by splitting work into three phases across a cluster of machines. The framework handles distribution, fault tolerance, and data movement โ€” you just write the Map and Reduce functions.

The Three Phases

  1. Map โ€” Input is split into chunks, each processed by a mapper in parallel. Each mapper transforms records into (key, value) pairs. Example: for word count, each word becomes ("word", 1).
  2. Shuffle & Sort โ€” The framework groups all values by key across all mappers. All pairs with the same key are sent to the same reducer. This is the most network-intensive phase.
  3. Reduce โ€” Each reducer receives a key and all its values, then aggregates them. Example: sum all the 1s for each word to get final counts.

Batch vs Stream Processing

Batch Processing (MapReduce / Spark)

Process bounded datasets in bulk. Latency: minutes to hours. Good for: analytics, ETL, building search indexes, ML training. Spark does this in-memory (10-100ร— faster than Hadoop MapReduce).

Stream Processing (Flink / Kafka Streams)

Process unbounded data in real-time as it arrives. Latency: ms to seconds. Good for: fraud detection, real-time dashboards, event-driven workflows. Windowing: tumbling, sliding, session.

Lambda vs Kappa Architecture

  • Lambda โ€” Batch layer (accurate, slow) + speed layer (approximate, fast) + serving layer. Complex: two codebases for the same logic.
  • Kappa โ€” Everything is a stream. Batch = replaying the stream from the beginning. Simpler. Kafka enables this with long retention. Preferred in modern systems.

Key Design Decisions

โšก

Batch vs Stream: Streaming gives real-time results but is complex (exactly-once, late data, state management). Batch is simpler, cheaper, and accurate โ€” but delayed. Use both: streaming for real-time metrics, batch for accuracy-critical analytics. Batch is still right for 80% of analytics workloads.

๐Ÿ’พ

Hadoop vs Spark: Hadoop MapReduce writes intermediate results to disk between stages โ€” reliable but slow. Spark keeps data in memory โ€” 10-100ร— faster. Use Spark for iterative workloads (ML, graph). Hadoop for extremely large datasets that don't fit in cluster memory.

๐Ÿ”„

Exactly-once in streaming: Possible with Flink/Kafka Streams but requires careful design (transactions, idempotent sinks). Once you call an external API, true exactly-once is impossible โ€” use idempotent consumers instead of chasing the perfect guarantee.

โฐ

Late data handling: Real events arrive out of order. Watermarks define "how late is too late." Events before the watermark are processed normally; events after are dropped or sent to a side output. Choosing the watermark delay is a tradeoff between completeness and latency.

When to Use

MapReduce / stream processing appears in interviews about data-intensive systems. Know when to pick batch vs stream.

  • "How do you process terabytes of log data?" โ€” Batch MapReduce (Spark) for daily aggregation
  • "Design a real-time trending topics system" โ€” Stream processing with sliding windows on hashtag counts
  • "Build a recommendation engine" โ€” Batch for training (Spark ML), stream for real-time signals
  • "Count word frequencies across petabytes" โ€” Classic MapReduce: Map emits (word, 1), Reduce sums

Interview signal: Explain MapReduce as three concrete phases with your scenario's data. Then explain why you'd use Spark over Hadoop. Mention Kappa architecture as the modern alternative.

Real-World Examples

  • Spotify Wrapped โ€” Full-year replay job processes 365 days of listening data (~430TB) on 2,000 Spark executors in ~4 hours. Daily batch jobs process 6B events/day for per-user analytics.
  • Google PageRank โ€” Original MapReduce use case. Iterative computation over the entire web graph. Modern equivalent: Spark GraphX or Pregel.
  • Twitter Trending Topics โ€” Kafka Streams processes ~500K tweets/minute with sliding windows. Counts hashtags, emits top-K every 5 seconds.
  • Uber Real-Time Surge Pricing โ€” Apache Flink processes ride request events in real-time. Computes supply/demand ratio per geographic cell per minute to set dynamic pricing.

Back-of-Envelope Numbers

Metric Value
Spotify daily events6B events/day (~1.2 TB)
Spotify Wrapped yearly job~430 TB, ~4 hours on 2K executors
Spark executor throughput~50-200 MB/s per core (in-memory)
Hadoop MapReduce throughput~10-50 MB/s per core (disk I/O bound)
Kafka Streams latency~1-10 ms per event
Flink exactly-once overhead~5-10% throughput reduction
Shuffle network bandwidthBottleneck: often 10 Gbps NIC limit
Spark vs Hadoop for iterative jobs10-100ร— faster (in-memory)