โš ๏ธ This guide is AI-generated and may contain inaccuracies. Always verify against authoritative sources and real-world documentation.

Architecture Diagram

Raw Data 10 GB COMPRESSION gzip / zstd H.264 / AV1 Parquet / ORC LOSSLESS 3 GB (3:1) 100% recoverable LOSSY 0.5 GB (20:1) Some data lost STORAGE S3 / HDFS + Network Transfer DE- COMP DATA TIERING ๐Ÿ”ฅ Hot (SSD) 0-7 days ยท $0.08/GB ๐ŸŒก๏ธ Warm (HDD) 7-90 days ยท $0.02/GB ๐ŸงŠ Cold (Glacier) 90+ days ยท $0.004/GB

How It Works

Data compression reduces the size of data for storage and transfer. The key distinction is whether the original data can be perfectly reconstructed.

Lossless vs Lossy Compression

Lossless Compression

Exact original recoverable. Algorithms: gzip (general purpose), Snappy/LZ4 (fast, lower ratio), Zstandard (best ratio + speed). Used for: logs, database pages, API responses, source code. Typical ratio: 2:1โ€“10:1.

Lossy Compression

Some data permanently lost. Algorithms: JPEG (images), H.264/H.265/AV1 (video), Opus/AAC (audio). Used for: media content where perceptual quality matters more than bit-perfect accuracy. Ratio: 10:1โ€“50:1.

Storage Optimization Patterns

  1. Columnar storage (Parquet, ORC): Store each column separately. Similar values compress better together. A 1TB CSV becomes ~100-200GB Parquet. Used in: BigQuery, Redshift, Snowflake.
  2. Data tiering: Hot data on SSD, warm on HDD, cold on object storage (S3 Glacier). Automatic lifecycle policies move data as it ages.
  3. Deduplication: Detect and eliminate duplicate chunks of data. Content-addressable storage โ€” hash each chunk, store unique chunks only. Used in backup systems.
  4. Protocol optimization: Protobuf over JSON (10ร— smaller). gzip on API responses (5:1 ratio). Binary formats over text.

Compression Algorithm Selection

The right algorithm depends on your bottleneck: CPU-bound systems need fast compression (LZ4, Snappy), storage-bound systems need high ratios (Zstandard, gzip), and media systems need perceptual quality optimization (AV1, JPEG XL).

Key Design Decisions

โšก

Compression ratio vs speed: LZ4 compresses at 780 MB/s but only 2:1 ratio. Zstandard achieves 3:1-5:1 at 500 MB/s. gzip gets 4:1-6:1 but at 50 MB/s. Pick based on whether you're CPU-bound or storage-bound.

๐ŸŽฌ

Encode once vs encode fast: For video, H.264 encodes quickly but produces larger files. AV1 produces 40% smaller files but takes 10ร— longer to encode. Solution: fast codec first for immediate availability, better codec in background for storage savings.

๐Ÿ“Š

Row vs columnar storage: Row storage (MySQL, PostgreSQL) is optimal for OLTP โ€” reading/writing complete rows. Columnar storage (Parquet, BigQuery) is optimal for analytics โ€” scanning specific columns across millions of rows. Never use columnar for OLTP.

๐Ÿ’ฐ

Compute cost vs storage cost: Compression trades CPU cycles for storage savings. At cloud prices, storage is often cheaper to save than CPU. But at petabyte scale, even small compression improvements save millions.

When to Use

Compression is relevant in almost every system design โ€” mention it at the right layer to show systems thinking.

  • "How do you store petabytes of log data cost-effectively?" โ€” Columnar format + Zstandard + tiering (hot/warm/cold)
  • "How do you optimize storage for analytics?" โ€” Parquet with dictionary encoding, 4:1โ€“10:1 compression
  • "Design a video streaming platform" โ€” Transcoding pipeline: H.264 first, AV1 in background for 40% savings
  • "How do you reduce API payload size?" โ€” gzip/Brotli on HTTP responses, Protobuf over JSON

Interview signal: Showing you think about compression at every layer (media, wire protocol, storage format) demonstrates mature systems thinking.

Real-World Examples

  • YouTube โ€” Uploads 10GB raw 4K video โ†’ transcodes into 6 resolutions ร— 3 codecs (H.264, VP9, AV1). AV1 saves 30-50% over H.264 at same quality. Total per video: ~4GB across all variants.
  • Facebook (Zstandard) โ€” Created zstd for compressing warehouse data โ€” better ratio than gzip at 3-5ร— faster speed. Now used in Linux kernel, FreeBSD, and most modern systems.
  • Apache Parquet โ€” Used by Spark, BigQuery, Snowflake. Columnar format compresses similar values together. Every modern data warehouse uses Parquet or ORC under the hood.
  • Netflix โ€” Predictive push: ML predicts popular content per region, pre-encodes to AV1 overnight. Top 10% of titles serve 90% of views.

Back-of-Envelope Numbers

Metric Value
LZ4 compression speed~780 MB/s (ratio 2:1)
Zstandard compression speed~500 MB/s (ratio 3:1โ€“5:1)
gzip compression speed~50 MB/s (ratio 4:1โ€“6:1)
Parquet vs CSV (typical)5:1โ€“10:1 size reduction
Protobuf vs JSON~10:1 size reduction
AV1 vs H.264 (video)30โ€“50% smaller at same quality
S3 Standard vs Glacier$0.023 vs $0.004/GB/month (20ร—)
HTTP gzip on JSON APIs~5:1 compression ratio