awesome-everything RU
↑ Back to the climb

APIs

Rate limiting: algorithms, the boundary burst, and the Redis counter

Crux Fixed window is cheap but leaks a 2x burst at the boundary; token bucket smooths it. The real production lesson is that a per-node in-memory limiter multiplies your limit by N behind a load balancer — the counter has to live in Redis.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at junior altitude — the surface
◷ 17 min

The limit is “100 requests per minute per API key” and the dashboard swears it is enforced. Then a customer pushes 380 requests through in two seconds and every one returns 200 OK. Nothing is broken in the limiter code — it works perfectly on one node. The problem: there are four app servers behind the load balancer, each running its own in-memory counter, so the real limit is 400/min. The extra 180 came from a client that hammered the window boundary across all four nodes at once. The limit was never real; it was four separate limits that happened to share a name.

Fixed window: cheap, and it leaks a 2x burst

The simplest limiter keys a counter by {key}:{minute} and does INCR; if the value exceeds the limit, reject. One integer per key, O(1), trivial to reason about. It is also the one every senior has been burned by, because of the boundary.

A fixed window resets on a wall-clock edge. If the limit is 100/min, a client can send 100 requests at 12:00:59.9 and another 100 at 12:01:00.1 — 200 requests inside a 200ms span, and both bursts are legal because they fall in different windows. The worst case is exactly 2x the configured limit concentrated at the seam between two windows. For a coarse internal throttle that is fine. For abuse protection or anything downstream-sensitive, that 2x is the difference between “protected” and “fell over.”

Sliding window: log is exact, counter is the cheap approximation

The sliding window log fixes the boundary by storing a timestamp for every request (a Redis sorted set), dropping entries older than the window, and counting what remains. It is exact — the limit is enforced over any rolling 60s, not a fixed minute. The cost is memory: a client doing 10k req/min holds 10k timestamps, and you pay ZADD plus a ZREMRANGEBYSCORE trim on every call. At scale the log is the memory hog you regret.

The sliding window counter is the production compromise. Keep two fixed-window counters — current and previous — and estimate the rolling count as current + previous * (overlap fraction). It is an approximation, but a famously good one: Cloudflare measured roughly 0.003% error across ~400 million requests, with two integers per key instead of thousands of timestamps. That accuracy-for-near-nothing trade is why it is the default in most edge limiters.

AlgorithmMemory / keyBoundary burst?When a senior reaches for it
Fixed window1 integerYes — up to 2x at the edgeCoarse internal throttle; approximate is fine
Sliding window logN timestampsNo — exactAuth/payments where an audit-grade exact count matters
Sliding window counter2 integersSmoothed — ~0.003% errorDefault edge limiter; great accuracy, tiny cost
Token bucket2 values (tokens + ts)Allows a controlled burstGeneral-purpose API default (Stripe, AWS) — real traffic is bursty
Leaky bucket2 valuesNo — smooths to constant rateProtecting a fragile downstream that needs steady throughput

Token bucket: model the burst instead of fighting it

Real API traffic is bursty — a user opens a page and fires twelve requests at once, then idles. Token bucket embraces this. A bucket holds up to capacity tokens and refills at a steady rate (say 100 tokens/min); each request spends one token, and a request is rejected only when the bucket is empty. A client that sat idle accumulates tokens up to the cap, so it can spend a burst instantly, then is throttled to the refill rate. You enforce an average rate while allowing short spikes — which is exactly what a friendly API wants. It needs only two stored values (current tokens, last-refill timestamp), which is why Stripe and AWS use it as the general default.

Leaky bucket is the mirror image: requests drip out at a fixed rate regardless of how they arrive, smoothing traffic to protect a fragile downstream. The cost is that it adds queuing latency on the client’s side and refuses to let legitimate bursts through. Use it when the thing behind you needs a constant, predictable rate more than it needs responsiveness.

Why this works

“Token bucket allows bursts” sounds like a weakness until you flip it: the burst is bounded by capacity, and capacity is a knob you set. A 100/min limit with a capacity of 100 lets a fresh client fire 100 at once; set capacity to 10 and the same average rate tolerates only a 10-request spike. The burst is not a bug, it is the parameter that lets you match the limiter to how clients actually behave.

Distributed: the counter has to be shared, and atomic

The Hook bug is the one that bites in production. An in-memory limiter is correct on a single process and silently wrong the moment you scale horizontally — each of N nodes enforces the limit independently, so the effective limit is N times what you configured, and a client spreading requests across nodes blows straight through. The fix is to move the counter to a shared store, almost always Redis, so every node reads and writes the same number.

But shared is not enough — it has to be atomic. The naive fixed-window approach is INCR then EXPIRE, and that is a race: two requests can both INCR, and if the process dies between INCR and EXPIRE the key never gets a TTL and the counter leaks forever. The robust pattern is a single Lua script, which Redis runs atomically — the whole read-decide-update cycle happens with no other command interleaved, so concurrent requests cannot double-spend tokens. A token-bucket check in Lua is a handful of operations and runs in sub-millisecond time, cheap enough to sit on the synchronous request path even at millions of requests per second.

-- token_bucket.lua  — KEYS[1] = bucket key, ARGV = rate, capacity, now, requested
local key       = KEYS[1]
local rate      = tonumber(ARGV[1])   -- tokens added per second
local capacity  = tonumber(ARGV[2])   -- max tokens (the burst size)
local now       = tonumber(ARGV[3])   -- current time, seconds
local requested = tonumber(ARGV[4])   -- tokens this request costs

local bucket    = redis.call("HMGET", key, "tokens", "ts")
local tokens    = tonumber(bucket[1]) or capacity
local ts        = tonumber(bucket[2]) or now

-- refill based on elapsed time, capped at capacity
local elapsed   = math.max(0, now - ts)
tokens          = math.min(capacity, tokens + elapsed * rate)

local allowed   = tokens >= requested
if allowed then tokens = tokens - requested end

redis.call("HSET", key, "tokens", tokens, "ts", now)
redis.call("EXPIRE", key, math.ceil(capacity / rate) * 2)
-- return allowed flag + retry-after seconds for the 429
local retry = allowed and 0 or math.ceil((requested - tokens) / rate)
return { allowed and 1 or 0, retry }

The 429 contract and the thundering herd

When you reject, the response is a contract, not just a status. Return 429 Too Many Requests with Retry-After (seconds, computed from when the next token frees up) so well-behaved clients back off instead of retrying in a tight loop. Add the RateLimit-* fields the IETF is standardizing — RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset — so clients can self-pace before they hit the wall. Crucially the draft specifies RateLimit-Reset as delta-seconds, not a UNIX timestamp: a timestamp would require client clock sync, and it would tell every throttled client to retry at the same absolute instant — a self-inflicted thundering herd at the reset boundary.

That herd is the second production trap. With a fixed window and a hard Retry-After of “32 seconds,” thousands of throttled clients all wake and retry in the same moment, hammering you at the boundary. The mitigations: prefer a sliding/token-bucket model where capacity frees up continuously rather than all at once, and add jitter — spread Retry-After values across a small random range so clients return staggered instead of in a synchronized wave. Decide your per-key dimension deliberately too: per-API-key is fair to paying customers, per-IP catches anonymous abuse but punishes everyone behind a corporate NAT, and per-endpoint lets you protect an expensive route without throttling cheap ones.

Pick the best fit

A public API serves bursty browser clients across 6 app nodes behind a load balancer. You want a fair average limit that tolerates normal page-load bursts. Pick the design.

Quiz

Your limit is 100/min using a fixed window. What is the worst-case number of requests a client can legitimately push in a ~1-second span?

Quiz

Why run the Redis rate-limit check as a single Lua script instead of an INCR followed by an EXPIRE?

Order the steps

Order what happens when a request hits a token-bucket limiter and is allowed:

  1. 1 Read the bucket's stored tokens and last-refill timestamp from Redis
  2. 2 Refill: add (elapsed time x rate) tokens, capped at capacity
  3. 3 Check: are there at least as many tokens as the request costs?
  4. 4 Subtract the request's cost from the token count
  5. 5 Write tokens + new timestamp back atomically and let the request through
Recall before you leave
  1. 01
    An in-memory rate limiter passes every test on your laptop but lets traffic through far above the configured limit in production. Explain the failure and the fix.
  2. 02
    Why does a fixed window allow a 2x burst, and how do sliding window and token bucket each address it?
Recap

Rate limiting is a stack of tradeoffs. Fixed window is one integer and O(1) but leaks up to a 2x burst at the window boundary; the sliding window log fixes that exactly at the cost of a timestamp per request; the sliding window counter approximates it with two integers and roughly 0.003% error, which is why it is the common edge default; and token bucket models bursty real traffic directly, enforcing an average rate while allowing a bounded burst set by capacity, which is why Stripe and AWS reach for it. None of that matters if the counter lives in per-node memory — behind a load balancer that silently multiplies your limit by the node count, so the counter must live in a shared store, almost always Redis, updated atomically with a Lua script so concurrent requests cannot double-spend or leak a TTL-less key. When you reject, honor the contract: 429 with Retry-After plus the IETF RateLimit-* fields, expressed in delta-seconds and jittered so every throttled client does not stampede back at the same reset boundary.

Continue the climb ↑Rate limiting: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.