APIs
Rate limiting: algorithms, the boundary burst, and the Redis counter
The limit is “100 requests per minute per API key” and the dashboard swears it is enforced. Then a customer pushes 380 requests through in two seconds and every one returns 200 OK. Nothing is broken in the limiter code — it works perfectly on one node. The problem: there are four app servers behind the load balancer, each running its own in-memory counter, so the real limit is 400/min. The extra 180 came from a client that hammered the window boundary across all four nodes at once. The limit was never real; it was four separate limits that happened to share a name.
Fixed window: cheap, and it leaks a 2x burst
The simplest limiter keys a counter by {key}:{minute} and does INCR; if the value exceeds the limit, reject. One integer per key, O(1), trivial to reason about. It is also the one every senior has been burned by, because of the boundary.
A fixed window resets on a wall-clock edge. If the limit is 100/min, a client can send 100 requests at 12:00:59.9 and another 100 at 12:01:00.1 — 200 requests inside a 200ms span, and both bursts are legal because they fall in different windows. The worst case is exactly 2x the configured limit concentrated at the seam between two windows. For a coarse internal throttle that is fine. For abuse protection or anything downstream-sensitive, that 2x is the difference between “protected” and “fell over.”
Sliding window: log is exact, counter is the cheap approximation
The sliding window log fixes the boundary by storing a timestamp for every request (a Redis sorted set), dropping entries older than the window, and counting what remains. It is exact — the limit is enforced over any rolling 60s, not a fixed minute. The cost is memory: a client doing 10k req/min holds 10k timestamps, and you pay ZADD plus a ZREMRANGEBYSCORE trim on every call. At scale the log is the memory hog you regret.
The sliding window counter is the production compromise. Keep two fixed-window counters — current and previous — and estimate the rolling count as current + previous * (overlap fraction). It is an approximation, but a famously good one: Cloudflare measured roughly 0.003% error across ~400 million requests, with two integers per key instead of thousands of timestamps. That accuracy-for-near-nothing trade is why it is the default in most edge limiters.
| Algorithm | Memory / key | Boundary burst? | When a senior reaches for it |
|---|---|---|---|
| Fixed window | 1 integer | Yes — up to 2x at the edge | Coarse internal throttle; approximate is fine |
| Sliding window log | N timestamps | No — exact | Auth/payments where an audit-grade exact count matters |
| Sliding window counter | 2 integers | Smoothed — ~0.003% error | Default edge limiter; great accuracy, tiny cost |
| Token bucket | 2 values (tokens + ts) | Allows a controlled burst | General-purpose API default (Stripe, AWS) — real traffic is bursty |
| Leaky bucket | 2 values | No — smooths to constant rate | Protecting a fragile downstream that needs steady throughput |
Token bucket: model the burst instead of fighting it
Real API traffic is bursty — a user opens a page and fires twelve requests at once, then idles. Token bucket embraces this. A bucket holds up to capacity tokens and refills at a steady rate (say 100 tokens/min); each request spends one token, and a request is rejected only when the bucket is empty. A client that sat idle accumulates tokens up to the cap, so it can spend a burst instantly, then is throttled to the refill rate. You enforce an average rate while allowing short spikes — which is exactly what a friendly API wants. It needs only two stored values (current tokens, last-refill timestamp), which is why Stripe and AWS use it as the general default.
Leaky bucket is the mirror image: requests drip out at a fixed rate regardless of how they arrive, smoothing traffic to protect a fragile downstream. The cost is that it adds queuing latency on the client’s side and refuses to let legitimate bursts through. Use it when the thing behind you needs a constant, predictable rate more than it needs responsiveness.
Why this works
“Token bucket allows bursts” sounds like a weakness until you flip it: the burst is bounded by capacity, and capacity is a knob you set. A 100/min limit with a capacity of 100 lets a fresh client fire 100 at once; set capacity to 10 and the same average rate tolerates only a 10-request spike. The burst is not a bug, it is the parameter that lets you match the limiter to how clients actually behave.
Distributed: the counter has to be shared, and atomic
The Hook bug is the one that bites in production. An in-memory limiter is correct on a single process and silently wrong the moment you scale horizontally — each of N nodes enforces the limit independently, so the effective limit is N times what you configured, and a client spreading requests across nodes blows straight through. The fix is to move the counter to a shared store, almost always Redis, so every node reads and writes the same number.
But shared is not enough — it has to be atomic. The naive fixed-window approach is INCR then EXPIRE, and that is a race: two requests can both INCR, and if the process dies between INCR and EXPIRE the key never gets a TTL and the counter leaks forever. The robust pattern is a single Lua script, which Redis runs atomically — the whole read-decide-update cycle happens with no other command interleaved, so concurrent requests cannot double-spend tokens. A token-bucket check in Lua is a handful of operations and runs in sub-millisecond time, cheap enough to sit on the synchronous request path even at millions of requests per second.
-- token_bucket.lua — KEYS[1] = bucket key, ARGV = rate, capacity, now, requested
local key = KEYS[1]
local rate = tonumber(ARGV[1]) -- tokens added per second
local capacity = tonumber(ARGV[2]) -- max tokens (the burst size)
local now = tonumber(ARGV[3]) -- current time, seconds
local requested = tonumber(ARGV[4]) -- tokens this request costs
local bucket = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(bucket[1]) or capacity
local ts = tonumber(bucket[2]) or now
-- refill based on elapsed time, capped at capacity
local elapsed = math.max(0, now - ts)
tokens = math.min(capacity, tokens + elapsed * rate)
local allowed = tokens >= requested
if allowed then tokens = tokens - requested end
redis.call("HSET", key, "tokens", tokens, "ts", now)
redis.call("EXPIRE", key, math.ceil(capacity / rate) * 2)
-- return allowed flag + retry-after seconds for the 429
local retry = allowed and 0 or math.ceil((requested - tokens) / rate)
return { allowed and 1 or 0, retry }The 429 contract and the thundering herd
When you reject, the response is a contract, not just a status. Return 429 Too Many Requests with Retry-After (seconds, computed from when the next token frees up) so well-behaved clients back off instead of retrying in a tight loop. Add the RateLimit-* fields the IETF is standardizing — RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset — so clients can self-pace before they hit the wall. Crucially the draft specifies RateLimit-Reset as delta-seconds, not a UNIX timestamp: a timestamp would require client clock sync, and it would tell every throttled client to retry at the same absolute instant — a self-inflicted thundering herd at the reset boundary.
That herd is the second production trap. With a fixed window and a hard Retry-After of “32 seconds,” thousands of throttled clients all wake and retry in the same moment, hammering you at the boundary. The mitigations: prefer a sliding/token-bucket model where capacity frees up continuously rather than all at once, and add jitter — spread Retry-After values across a small random range so clients return staggered instead of in a synchronized wave. Decide your per-key dimension deliberately too: per-API-key is fair to paying customers, per-IP catches anonymous abuse but punishes everyone behind a corporate NAT, and per-endpoint lets you protect an expensive route without throttling cheap ones.
A public API serves bursty browser clients across 6 app nodes behind a load balancer. You want a fair average limit that tolerates normal page-load bursts. Pick the design.
Your limit is 100/min using a fixed window. What is the worst-case number of requests a client can legitimately push in a ~1-second span?
Why run the Redis rate-limit check as a single Lua script instead of an INCR followed by an EXPIRE?
Order what happens when a request hits a token-bucket limiter and is allowed:
- 1 Read the bucket's stored tokens and last-refill timestamp from Redis
- 2 Refill: add (elapsed time x rate) tokens, capped at capacity
- 3 Check: are there at least as many tokens as the request costs?
- 4 Subtract the request's cost from the token count
- 5 Write tokens + new timestamp back atomically and let the request through
- 01An in-memory rate limiter passes every test on your laptop but lets traffic through far above the configured limit in production. Explain the failure and the fix.
- 02Why does a fixed window allow a 2x burst, and how do sliding window and token bucket each address it?
Rate limiting is a stack of tradeoffs. Fixed window is one integer and O(1) but leaks up to a 2x burst at the window boundary; the sliding window log fixes that exactly at the cost of a timestamp per request; the sliding window counter approximates it with two integers and roughly 0.003% error, which is why it is the common edge default; and token bucket models bursty real traffic directly, enforcing an average rate while allowing a bounded burst set by capacity, which is why Stripe and AWS reach for it. None of that matters if the counter lives in per-node memory — behind a load balancer that silently multiplies your limit by the node count, so the counter must live in a shared store, almost always Redis, updated atomically with a Lua script so concurrent requests cannot double-spend or leak a TTL-less key. When you reject, honor the contract: 429 with Retry-After plus the IETF RateLimit-* fields, expressed in delta-seconds and jittered so every throttled client does not stampede back at the same reset boundary.