Processing massive datasets across distributed systems. Batch processing (MapReduce) for bounded datasets in bulk, stream processing for unbounded data in real-time.
Medium-Hard Medium FrequencyMapReduce processes large datasets by splitting work into three phases across a cluster of machines. The framework handles distribution, fault tolerance, and data movement โ you just write the Map and Reduce functions.
("word", 1).Process bounded datasets in bulk. Latency: minutes to hours. Good for: analytics, ETL, building search indexes, ML training. Spark does this in-memory (10-100ร faster than Hadoop MapReduce).
Process unbounded data in real-time as it arrives. Latency: ms to seconds. Good for: fraud detection, real-time dashboards, event-driven workflows. Windowing: tumbling, sliding, session.
Batch vs Stream: Streaming gives real-time results but is complex (exactly-once, late data, state management). Batch is simpler, cheaper, and accurate โ but delayed. Use both: streaming for real-time metrics, batch for accuracy-critical analytics. Batch is still right for 80% of analytics workloads.
Hadoop vs Spark: Hadoop MapReduce writes intermediate results to disk between stages โ reliable but slow. Spark keeps data in memory โ 10-100ร faster. Use Spark for iterative workloads (ML, graph). Hadoop for extremely large datasets that don't fit in cluster memory.
Exactly-once in streaming: Possible with Flink/Kafka Streams but requires careful design (transactions, idempotent sinks). Once you call an external API, true exactly-once is impossible โ use idempotent consumers instead of chasing the perfect guarantee.
Late data handling: Real events arrive out of order. Watermarks define "how late is too late." Events before the watermark are processed normally; events after are dropped or sent to a side output. Choosing the watermark delay is a tradeoff between completeness and latency.
MapReduce / stream processing appears in interviews about data-intensive systems. Know when to pick batch vs stream.
Interview signal: Explain MapReduce as three concrete phases with your scenario's data. Then explain why you'd use Spark over Hadoop. Mention Kappa architecture as the modern alternative.
| Metric | Value |
|---|---|
| Spotify daily events | 6B events/day (~1.2 TB) |
| Spotify Wrapped yearly job | ~430 TB, ~4 hours on 2K executors |
| Spark executor throughput | ~50-200 MB/s per core (in-memory) |
| Hadoop MapReduce throughput | ~10-50 MB/s per core (disk I/O bound) |
| Kafka Streams latency | ~1-10 ms per event |
| Flink exactly-once overhead | ~5-10% throughput reduction |
| Shuffle network bandwidth | Bottleneck: often 10 Gbps NIC limit |
| Spark vs Hadoop for iterative jobs | 10-100ร faster (in-memory) |