API Rate Limiting — System Design Pattern

Architecture Diagram — Token Bucket

How It Works

A rate limiter sits in front of your API (usually at the gateway) and tracks how many requests each client makes. When a client exceeds the limit, the limiter rejects requests with HTTP 429 (Too Many Requests) until tokens replenish.

Rate Limiting Algorithms

Token Bucket — Bucket holds max N tokens, refills at rate R/sec. Each request consumes one token. Allows bursts up to N while maintaining average rate R. Most common in practice (GitHub, Stripe).
Leaky Bucket — Requests enter a FIFO queue, processed at a fixed rate. Smooths output completely — no bursts. Good for rate shaping but adds latency.
Fixed Window — Count requests per fixed time window (e.g., per minute). Simple but allows 2× rate at window boundaries (100 req at 0:59 + 100 at 1:01).
Sliding Window Log — Store timestamp of each request, count in last T seconds. Precise but memory-intensive (stores every timestamp).
Sliding Window Counter — Hybrid: weighted combination of current and previous window counts. Good balance of precision and efficiency.

Distributed Rate Limiting

With multiple API servers, you need shared state. Redis is the standard choice: use a Lua script (atomic INCR + EXPIRE, or token bucket logic) to avoid race conditions. A single Redis instance handles 100K+ ops/sec — more than enough for rate limiting.

Key Design Decisions

📍

Where to enforce — gateway vs service: Gateway (Kong, AWS API Gateway) catches abuse early and is centralized. But service-level limits allow fine-grained rules (e.g., "only 100 repo creations/hour"). Best practice: coarse limit at gateway, fine-grained in services.

🪣

Token Bucket vs Fixed Window: Token Bucket allows bursts while maintaining average rate — better UX. Fixed Window is simpler but the boundary burst problem (2× rate at window edge) can overwhelm services. Token Bucket wins for most use cases.

🔑

Rate limit key: By API key (per-developer), by user ID (per-account), by IP (anonymous). Tiered limits: free tier = 100/hr, paid = 5000/hr. Consider: authenticated vs anonymous, read vs write operations.

📊

Hard vs soft limits: Hard: reject immediately at the limit. Soft: allow some overflow, log it, maybe degrade quality (serve cached response). Soft limits are friendlier but harder to enforce fairness.

When to Use

"How do you prevent abuse?" — Rate limiting is the first line of defense for any public API.
"How do you handle a DDoS?" — Rate limiting + circuit breaker at the edge.
"Design an API for multi-tenant SaaS" — Fair usage per tenant requires per-tenant rate limits.
"Design a chat app" — Rate limit messages per user (e.g., 5 msg/sec, burst of 10) to prevent spam.

Interview signal: Lead with the algorithm choice and justify it. "I'd use token bucket because it handles bursts while maintaining average rate" is much stronger than just saying "I'd add rate limiting."

Real-World Examples

GitHub API — 5,000 requests/hour per authenticated user. Token bucket with Redis. Returns X-RateLimit-Remaining headers so clients can self-throttle.
Stripe — Rate limits per API key with different tiers. 100 req/sec for most endpoints, lower for resource-intensive operations. Graduated enforcement: warn, then throttle.
Cloudflare — Edge rate limiting at 300+ PoPs worldwide. Rules based on URL path, IP, headers. Can block millions of req/sec at the edge before traffic reaches origin.
Discord — Per-route rate limits (e.g., 5 msg/5sec per channel). Returns Retry-After header. Bots that ignore limits get globally rate-limited, then banned.

Back-of-Envelope Numbers

Metric	Value
Redis memory per user (token bucket state)	~64 bytes × 2M active users = ~128MB
Redis throughput for rate limiting	~100K+ ops/sec (single instance)
Lua script latency (atomic check)	~0.1 ms
GitHub rate limit	5,000 req/hour per user
Stripe rate limit	100 req/sec per API key
Fixed window boundary burst	2× nominal rate in worst case
Cost of NOT rate limiting	One script at 10K req/sec can down a service

🚦 API Rate Limiting