Bloom Filters — System Design Pattern

Architecture Diagram

How It Works

A Bloom filter is a bit array of m bits (all initially 0) and k independent hash functions. To add an element, hash it with all k functions and set the corresponding bits to 1. To check membership, hash the element and verify all k bits are set. If any bit is 0, the element is definitely not in the set. If all bits are 1, it's probably in the set (false positive possible).

Why False Positives Happen

Different elements may hash to the same bit positions. As the filter fills up, more bits are set to 1, increasing the probability that a non-member's bits are all coincidentally set. The false positive rate is: p ≈ (1 - e^(-kn/m))^k where n = number of elements, m = bits, k = hash functions.

Other Probabilistic Structures

HyperLogLog

Approximate count of distinct elements (cardinality estimation). Uses ~12KB memory to count billions of unique items with ~0.8% standard error. Used by Redis for unique visitor counting.

Count-Min Sketch

Approximate frequency counting in a data stream. Like a 2D Bloom filter — hash to rows, increment counters. Query returns the minimum across rows. Overcounts slightly, never undercounts.

Optimal Parameters

Optimal k (hash functions) = (m/n) × ln(2) — typically 7-11 for common FP rates
Bits per element for 1% FP rate: ~9.6 bits/element
Bits per element for 0.1% FP rate: ~14.4 bits/element
Bits per element for 0.01% FP rate: ~19.2 bits/element

Key Design Decisions

📏

False positive rate vs memory: 1% FP needs ~10 bits/element (6MB for 5M items). 0.01% FP needs ~20 bits/element (12MB). Tune based on the cost of a false positive — if it triggers an expensive DB lookup, lower the FP rate.

🗑️

Deletion support: Standard Bloom filters can't delete (unsetting a bit may affect other elements). Use a Counting Bloom Filter (counters instead of bits) or Cuckoo Filter (supports deletion + better space efficiency at low FP rates).

📈

Scalable Bloom Filters: If you don't know the set size upfront, use a Scalable Bloom Filter — a chain of increasingly larger filters. New elements go to the latest filter; checks query all filters. Slight overhead from checking multiple filters.

⚡

Hash function choice: Use fast, uniform hash functions — MurmurHash3, xxHash, or FNV. You don't need cryptographic hashes (SHA/MD5). The "double hashing" trick: compute only h1(x) and h2(x), derive k hashes as hi(x) = h1(x) + i × h2(x).

When to Use

Bloom filters shine when most lookups are expected to be misses and you need to avoid expensive operations for those misses.

"Design a web crawler" — Check if a URL was already visited. 1B URLs × 10 bits = 1.2GB (vs 64GB for a hash set)
"How do you speed up database reads?" — Bloom filter on disk blocks. Skip blocks that definitely don't contain the key
"Design a spam filter" — Check if sender is in known-spammer set without loading millions of records
"How do you avoid expensive joins?" — Pre-filter with Bloom filter, only join rows that probably match

Interview signal: Explain the asymmetry — false negatives would be catastrophic (missing a malicious URL) but false positives are cheap (just do one extra API check). This shows you understand why the tradeoff works.

Real-World Examples

Google Chrome Safe Browsing — Ships a ~30MB Bloom filter of known malicious URLs (~5M entries, K=11). Checks locally before loading any URL. False positive triggers an API call to confirm — happens ~0.01% of the time.
Google Bigtable / Apache HBase — Bloom filter on SSTable/HFile blocks. Before reading a block from disk, check the Bloom filter. Avoids ~99% of unnecessary disk reads for absent keys.
Apache Cassandra — Each SSTable has a Bloom filter. A read query checks filters to skip SSTables that definitely don't contain the requested partition key.
Redis — Built-in HyperLogLog command (PFADD / PFCOUNT). 12KB memory to count billions of unique elements with ~0.8% error.

Back-of-Envelope Numbers

Metric	Value
Hash set: 1B URLs × 64 bytes	~64 GB
Bloom filter: 1B URLs × 10 bits (1% FP)	~1.2 GB
Memory savings vs hash set	~50× less
Chrome Safe Browsing filter size	~30 MB (5M URLs)
Optimal hash functions (1% FP)	K = 7
Optimal hash functions (0.01% FP)	K = 11
HyperLogLog memory for billions of items	~12 KB
Lookup time	O(k) hash computations, ~ns

🎯 Bloom Filters & Probabilistic Data Structures