Reduce storage costs and network bandwidth at scale. Lossless vs lossy, columnar formats, data tiering โ the techniques that make petabyte-scale systems affordable.
Medium Medium FrequencyData compression reduces the size of data for storage and transfer. The key distinction is whether the original data can be perfectly reconstructed.
Exact original recoverable. Algorithms: gzip (general purpose), Snappy/LZ4 (fast, lower ratio), Zstandard (best ratio + speed). Used for: logs, database pages, API responses, source code. Typical ratio: 2:1โ10:1.
Some data permanently lost. Algorithms: JPEG (images), H.264/H.265/AV1 (video), Opus/AAC (audio). Used for: media content where perceptual quality matters more than bit-perfect accuracy. Ratio: 10:1โ50:1.
The right algorithm depends on your bottleneck: CPU-bound systems need fast compression (LZ4, Snappy), storage-bound systems need high ratios (Zstandard, gzip), and media systems need perceptual quality optimization (AV1, JPEG XL).
Compression ratio vs speed: LZ4 compresses at 780 MB/s but only 2:1 ratio. Zstandard achieves 3:1-5:1 at 500 MB/s. gzip gets 4:1-6:1 but at 50 MB/s. Pick based on whether you're CPU-bound or storage-bound.
Encode once vs encode fast: For video, H.264 encodes quickly but produces larger files. AV1 produces 40% smaller files but takes 10ร longer to encode. Solution: fast codec first for immediate availability, better codec in background for storage savings.
Row vs columnar storage: Row storage (MySQL, PostgreSQL) is optimal for OLTP โ reading/writing complete rows. Columnar storage (Parquet, BigQuery) is optimal for analytics โ scanning specific columns across millions of rows. Never use columnar for OLTP.
Compute cost vs storage cost: Compression trades CPU cycles for storage savings. At cloud prices, storage is often cheaper to save than CPU. But at petabyte scale, even small compression improvements save millions.
Compression is relevant in almost every system design โ mention it at the right layer to show systems thinking.
Interview signal: Showing you think about compression at every layer (media, wire protocol, storage format) demonstrates mature systems thinking.
| Metric | Value |
|---|---|
| LZ4 compression speed | ~780 MB/s (ratio 2:1) |
| Zstandard compression speed | ~500 MB/s (ratio 3:1โ5:1) |
| gzip compression speed | ~50 MB/s (ratio 4:1โ6:1) |
| Parquet vs CSV (typical) | 5:1โ10:1 size reduction |
| Protobuf vs JSON | ~10:1 size reduction |
| AV1 vs H.264 (video) | 30โ50% smaller at same quality |
| S3 Standard vs Glacier | $0.023 vs $0.004/GB/month (20ร) |
| HTTP gzip on JSON APIs | ~5:1 compression ratio |