Distributed ID Generation — System Design Pattern

Architecture Diagram — Snowflake ID

How It Works

In distributed systems, auto-increment IDs from a single database don't work — they create a bottleneck and single point of failure. You need IDs that are unique across all nodes, roughly sortable by time, and generated without coordination.

Snowflake ID Generation

API server needs an ID → calls local Snowflake generator (embedded or sidecar service)
Generator composes 64-bit ID: [1-bit unused][41-bit timestamp][10-bit machine ID][12-bit sequence]
Timestamp = current_time_ms - custom_epoch. Twitter uses 2010-11-04 as epoch → gives 69 years of IDs
Machine ID is pre-assigned per server (from ZooKeeper, config, or Kubernetes pod ordinal)
Sequence increments within the same millisecond (0–4095). Resets each millisecond
If 4096 IDs exhausted in one ms: wait until next millisecond (extremely rare per machine)
ID generated locally, no network call → <0.1ms per ID

ID Generation Approaches

Snowflake (Twitter)

64-bit, time-sortable, compact. 4,096 IDs/ms/machine. Needs machine ID coordination. Clock skew risk. Used by Twitter, Discord, Instagram (variant).

UUID v4

128-bit random. Universally unique, zero coordination. But: not sortable, 36 chars as string, random writes cause B-tree page splits → poor index performance.

ULID

128-bit = 48-bit timestamp + 80-bit random. Sortable, lexicographic ordering, Crockford Base32 encoded. String-friendly. Good UUID replacement when you need sortability.

Database Ticket Server

Central server allocates ID blocks (e.g., Server A gets 1–1000, B gets 1001–2000). Simple but adds coordination. Flickr used two MySQL auto-increment servers with odd/even IDs.

Key Design Decisions

🔢

Snowflake vs UUID: Snowflake: 64-bit (8 bytes), sortable, sequential index writes. UUID v4: 128-bit (16 bytes), random, causes index fragmentation. For databases with B-tree indexes, Snowflake is 2× smaller and doesn't cause page splits. Use UUID only when you need zero coordination and don't care about sortability.

⏰

Clock skew risk: Snowflake depends on monotonic time. If a server's clock jumps backward (NTP correction), it could generate duplicate IDs or IDs out of order. Mitigations: wait until clock catches up, use monotonic clock, or add sequence bits. Twitter's Snowflake refuses to generate IDs if clock moves backward.

🏗️

Embedded vs service: Separate Snowflake service = one more network hop (~1ms) but centralized machine ID management. Embedded in app = zero latency but needs machine ID assignment mechanism. Most companies embed it now (Instagram's PL/PGSQL function, Discord's in-process generator).

🔒

ID as security risk: Sequential IDs reveal volume — a competitor can infer your order count by creating orders days apart. For public-facing IDs (order numbers, invoice IDs), use obfuscated or random-looking IDs. Keep Snowflake IDs for internal use.

When to Use

Distributed ID generation is usually a sub-component of a larger system design, not the main problem. But getting it right matters for performance and correctness.

"Design a URL shortener" — Need globally unique short codes. Snowflake ID → Base62 encode
"Design Twitter / Instagram" — Every post needs a unique, time-sortable ID across all servers
"Design a distributed database" — Sharded tables need IDs that don't collide across shards
"How do you generate 10K IDs/sec across 50 servers?" — Snowflake: each server generates independently, machine ID ensures uniqueness

Interview signal: Sketch the 64-bit layout on the whiteboard and calculate the limits. This shows you understand the design constraints, not just the name.

Real-World Examples

Twitter Snowflake — Created to generate ~10K unique tweet IDs per second per server. Every tweet ID (like 1234567890123456789) is a Snowflake ID. Open-sourced in 2010, now the industry standard pattern.
Instagram sharded IDs — Similar scheme: 41 bits timestamp + 13 bits shard ID + 10 bits auto-increment. Each Postgres shard generates its own IDs independently using a PL/PGSQL function. No external service needed.
Discord Snowflakes — Discord uses Snowflake IDs for messages, users, channels, guilds. The timestamp component lets them efficiently query "messages in this channel after time T" using the ID as a time-based filter.
Sony's Sonyflake — Variant optimized for longer lifespan: 39-bit timestamp (in 10ms units, ~174 years) + 8-bit sequence + 16-bit machine ID (65,536 machines). Tradeoff: lower throughput (256/10ms = 25.6K/sec) for more machines and longer epoch.

Back-of-Envelope Numbers

Metric	Value
Snowflake IDs per ms per machine	4,096
Max machines (10-bit)	1,024
Theoretical max throughput (all machines)	~4.2 billion IDs/sec
Snowflake epoch lifespan (41-bit ms)	~69.7 years
Snowflake ID size	64-bit (8 bytes)
UUID v4 size	128-bit (16 bytes)
UUID v4 collision probability (1B IDs)	~10⁻¹⁸ (negligible)
ID generation latency (embedded)	<0.1 ms (no network call)

🔢 Distributed ID Generation