Data structures that speed up reads at the cost of slower writes and extra storage. Choosing the right index type and column order is crucial for system performance.
Medium High FrequencyWithout an index, the database must scan every row in the table (full table scan) to find matching records. An index is an auxiliary data structure that lets the database jump directly to the relevant rows โ like a book's index lets you find a topic without reading every page.
Balanced tree with O(log n) lookup. Internal nodes store keys for navigation; leaf nodes store data pointers and are linked for efficient range scans. Handles both equality and range queries. 4 levels covers ~1B rows.
Writes go to in-memory memtable โ flush to sorted SSTables on disk โ background compaction. All writes are sequential (fast). Reads may check multiple levels. Used in Cassandra, RocksDB, LevelDB.
O(1) equality lookups using a hash table. Cannot do range queries or sorting. Used in memory stores (Redis, Memcached). Some DBs support it (PostgreSQL hash index) but B-tree is usually better.
Maps terms โ list of documents containing them. Essential for full-text search. "database" โ [doc1, doc5, doc99]. Elasticsearch / Lucene. Also used for tag-based lookups.
For queries with multiple conditions, the column order in a composite index matters enormously. Follow the ESR rule:
WHERE city = 'NYC')ORDER BY created_at DESC)WHERE age > 25)Why? A range condition breaks the index ordering for subsequent columns. Putting sort before range lets the database read results in order without an extra sort step.
B-tree vs LSM Tree: B-tree is read-optimized with predictable latency โ best for OLTP (web apps, transactions). LSM Tree is write-optimized with sequential writes โ best for write-heavy workloads (time series, logs, IoT). The tradeoff: LSM reads are slower (must check multiple levels) and compaction uses CPU/disk.
Composite index column order: INDEX(city, price, rating) seems logical but breaks the ESR rule if price is a range condition. INDEX(city, rating, price) is better โ equality first, then sort, then range. The database can scan the index in order and filter price on the fly.
Over-indexing: Each index slows writes โ every INSERT/UPDATE/DELETE must update all indexes. A table with 10 indexes means 10 extra B-tree writes per row change. Index what you query, not everything.
Selectivity: Indexing a boolean column (2 values) is nearly useless โ the index matches 50% of rows, so the DB does a table scan anyway. Index columns with high cardinality (many distinct values). Exceptions: partial indexes on rare values (WHERE status = 'failed').
Indexing comes up in almost every system design interview that involves a database. It's often the first optimization to discuss before caching or sharding.
Interview signal: Never just say "add an index." Specify the exact columns and their order, then explain WHY using the ESR rule. This separates senior from junior answers.
(city, type, available_date, rating, price) following ESR. Narrows 7M listings to ~2K candidates in ~5ms instead of scanning all rows.| Metric | Value |
|---|---|
| B+ tree depth for 10M rows | 3โ4 levels |
| B+ tree depth for 1B rows | 4โ5 levels |
| B+ tree branching factor | ~200โ500 (page size dependent) |
| Index lookup (B+ tree, in memory) | ~0.1โ1 ms |
| Full table scan (10M rows) | ~10โ30 seconds |
| Index size per entry | ~50โ100 bytes |
| Cassandra write throughput (LSM) | 1M+ writes/sec |
| Write overhead per additional index | ~1 extra B-tree insert per row write |