Split data across multiple database instances to handle more data and traffic than a single machine can manage. The most critical decision: choosing the right shard key.
Medium Very High FrequencySharding splits data across multiple database instances (shards) so no single machine bears the full data volume or write load. Each shard has the same schema but holds a different subset of the data, determined by a shard key.
Split rows across databases by a shard key (e.g., user_id % N). Each shard has the same schema, different data. This is what people mean by "sharding."
Split columns โ move large or rarely-used columns to a separate table/database (e.g., user profile vs user activity log). Reduces row size for hot data.
The shard key determines which shard holds each row. A good key gives even data distribution AND keeps most queries on a single shard.
hash(user_id) % N. Even distribution but cross-shard range queries are impossible. Most common.Even with a "good" shard key, celebrity users or viral content can overwhelm a single shard. When user_id=2 has 100M followers and posts frequently, shard 2 gets 80% of read traffic. Fixes: sub-sharding hot keys, caching hot data, or routing celebrity reads to dedicated replicas.
Shard key: user_id vs entity_id: user_id keeps all of a user's data on one shard (fast profile page) but creates hotspots for celebrity users. entity_id (e.g., photo_id) distributes evenly but makes "show all my photos" a scatter-gather across all shards. Instagram chose user_id + special handling for hot users.
Resharding strategy: Adding shards means rebalancing data. With key % N, changing N remaps almost everything. Use consistent hashing or logical shards (4096 logical โ N physical) to minimize data movement. Plan resharding from day one.
ID generation: Auto-increment IDs break across shards. Use Snowflake IDs (timestamp + machine + sequence), ULIDs, or embed shard hints in the ID itself (Instagram's approach: timestamp + shard_id + sequence).
Cross-shard queries: JOINs across shards require scatter-gather โ query all shards, merge results. Expensive and slow. Alternatives: denormalize data, use a separate search index (Elasticsearch), or redesign the shard key to keep related data together.
Sharding adds massive complexity. Only shard when simpler solutions (read replicas, caching, vertical scaling) aren't enough.
Interview signal: The interviewer wants to see you discuss shard key selection FIRST, then address cross-shard queries and resharding. Mention what queries become expensive and how you'd handle them.
| Metric | Value |
|---|---|
| Instagram photo metadata (2B+ photos) | ~1TB total, ~500 bytes/row |
| Growth rate (100M new photos/day) | ~50GB/day new metadata |
| Per-shard data (12 physical servers) | ~83GB/server (fits in memory) |
| Writes per shard | ~96 writes/sec (100M/day รท 12 servers) |
| Resharding trigger | Server exceeds 500GB or 5K writes/sec |
| Cross-shard query overhead | ~10โ100ร single-shard query |
| Logical shards (Instagram approach) | 4096 logical โ 12 physical (room to grow) |