APIs API · 07 · 01

Rate limiting: algorithms, the boundary burst, and the Redis counter

Fixed window is cheap but leaks a 2x burst at the boundary; token bucket smooths it. The real production lesson is that a per-node in-memory limiter multiplies your limit by N behind a load balancer — the counter has to live in Redis.

API Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

The limit is “100 requests per minute per API key” and the dashboard swears it is enforced. Then a customer pushes 380 requests through in two seconds and every one returns 200 OK. Nothing is broken in the limiter code — it works perfectly on one node. The problem: there are four app servers behind the load balancer, each running its own in-memory counter, so the real limit is 400/min. The extra 180 came from a client that hammered the window boundary across all four nodes at once. The limit was never real; it was four separate limits that happened to share a name.

In the next few minutes you will see why every algorithm eventually fails the same way — and exactly what it takes to make a rate limit that actually holds across multiple nodes.

Fixed window: cheap, and it leaks a 2x burst

The simplest limiter keys a counter by {key}:{minute} and does INCR; if the value exceeds the limit, reject. One integer per key, O(1), trivial to reason about. It is also the one every senior has been burned by, because of the boundary.

A fixed window resets on a wall-clock edge. If the limit is 100/min, a client can send 100 requests at 12:00:59.9 and another 100 at 12:01:00.1 — 200 requests inside a 200ms span, and both bursts are legal because they fall in different windows. The worst case is exactly 2x the configured limit concentrated at the seam between two windows. For a coarse internal throttle that is fine. For abuse protection or anything downstream-sensitive, that 2x is the difference between “protected” and “fell over.”

Sliding window: log is exact, counter is the cheap approximation

The sliding window log fixes the boundary by storing a timestamp for every request (a Redis sorted set), dropping entries older than the window, and counting what remains. It is exact — the limit is enforced over any rolling 60s, not a fixed minute. The cost is memory: a client doing 10k req/min holds 10k timestamps, and you pay ZADD plus a ZREMRANGEBYSCORE trim on every call. At scale the log is the memory hog you regret.

The sliding window counter is the production compromise. Keep two fixed-window counters — current and previous — and estimate the rolling count as current + previous * (overlap fraction). It is an approximation, but a famously good one: Cloudflare measured roughly 0.003% error across ~400 million requests, with two integers per key instead of thousands of timestamps. That accuracy-for-near-nothing trade is why it is the default in most edge limiters.

Algorithm	Memory / key	Boundary burst?	When a senior reaches for it
Fixed window	1 integer	Yes — up to 2x at the edge	Coarse internal throttle; approximate is fine
Sliding window log	N timestamps	No — exact	Auth/payments where an audit-grade exact count matters
Sliding window counter	2 integers	Smoothed — ~0.003% error	Default edge limiter; great accuracy, tiny cost
Token bucket	2 values (tokens + ts)	Allows a controlled burst	General-purpose API default (Stripe, AWS) — real traffic is bursty
Leaky bucket	2 values	No — smooths to constant rate	Protecting a fragile downstream that needs steady throughput

Together these four algorithms cover every production scenario: fixed window when cost matters most and precision matters least, sliding log when you need an auditable record, sliding counter as the sensible default, and token bucket when clients are bursty by nature. Without the distributed fix below, though, every row in this table is broken behind a load balancer.

Token bucket: model the burst instead of fighting it

Real API traffic is bursty — a user opens a page and fires twelve requests at once, then idles. Token bucket embraces this. A bucket holds up to capacity tokens and refills at a steady rate (say 100 tokens/min); each request spends one token, and a request is rejected only when the bucket is empty. A client that sat idle accumulates tokens up to the cap, so it can spend a burst instantly, then is throttled to the refill rate. You enforce an average rate while allowing short spikes — which is exactly what a friendly API wants. It needs only two stored values (current tokens, last-refill timestamp), which is why Stripe and AWS use it as the general default.

Leaky bucket is the mirror image: requests drip out at a fixed rate regardless of how they arrive, smoothing traffic to protect a fragile downstream. The cost is that it adds queuing latency on the client’s side and refuses to let legitimate bursts through. Use it when the thing behind you needs a constant, predictable rate more than it needs responsiveness.

▸Why this works

“Token bucket allows bursts” sounds like a weakness until you flip it: the burst is bounded by capacity, and capacity is a knob you set. A 100/min limit with a capacity of 100 lets a fresh client fire 100 at once; set capacity to 10 and the same average rate tolerates only a 10-request spike. The burst is not a bug, it is the parameter that lets you match the limiter to how clients actually behave.

Distributed: the counter has to be shared, and atomic

The Hook bug is the one that bites in production. An in-memory limiter is correct on a single process and silently wrong the moment you scale horizontally — each of N nodes enforces the limit independently, so the effective limit is N times what you configured, and a client spreading requests across nodes blows straight through. The fix is to move the counter to a shared store, almost always Redis, so every node reads and writes the same number.

You set 100/min, but four per-node counters make the real ceiling 400 — so the client's 380-request burst sailed through. N nodes means N× your limit until the counter lives in Redis.

But shared is not enough — it has to be atomic. The naive fixed-window approach is INCR then EXPIRE, and that is a race: two requests can both INCR, and if the process dies between INCR and EXPIRE the key never gets a TTL and the counter leaks forever. The robust pattern is a single Lua script, which Redis runs atomically — the whole read-decide-update cycle happens with no other command interleaved, so concurrent requests cannot double-spend tokens. A token-bucket check in Lua is a handful of operations and runs in sub-millisecond time, cheap enough to sit on the synchronous request path even at millions of requests per second.

-- token_bucket.lua  — KEYS[1] = bucket key, ARGV = rate, capacity, now, requested
local key       = KEYS[1]
local rate      = tonumber(ARGV[1])   -- tokens added per second
local capacity  = tonumber(ARGV[2])   -- max tokens (the burst size)
local now       = tonumber(ARGV[3])   -- current time, seconds
local requested = tonumber(ARGV[4])   -- tokens this request costs

local bucket    = redis.call("HMGET", key, "tokens", "ts")
local tokens    = tonumber(bucket[1]) or capacity
local ts        = tonumber(bucket[2]) or now

-- refill based on elapsed time, capped at capacity
local elapsed   = math.max(0, now - ts)
tokens          = math.min(capacity, tokens + elapsed * rate)

local allowed   = tokens >= requested
if allowed then tokens = tokens - requested end

redis.call("HSET", key, "tokens", tokens, "ts", now)
redis.call("EXPIRE", key, math.ceil(capacity / rate) * 2)
-- return allowed flag + retry-after seconds for the 429
local retry = allowed and 0 or math.ceil((requested - tokens) / rate)
return { allowed and 1 or 0, retry }

The 429 contract and the thundering herd

When you reject, the response is a contract, not just a status. Return 429 Too Many Requests with Retry-After (seconds, computed from when the next token frees up) so well-behaved clients back off instead of retrying in a tight loop. Add the RateLimit-* fields the IETF is standardizing — RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset — so clients can self-pace before they hit the wall. Crucially the draft specifies RateLimit-Reset as delta-seconds, not a UNIX timestamp: a timestamp would require client clock sync, and it would tell every throttled client to retry at the same absolute instant — a self-inflicted thundering herd at the reset boundary.

That herd is the second production trap. With a fixed window and a hard Retry-After of “32 seconds,” thousands of throttled clients all wake and retry in the same moment, hammering you at the boundary. The mitigations: prefer a sliding/token-bucket model where capacity frees up continuously rather than all at once, and add jitter — spread Retry-After values across a small random range so clients return staggered instead of in a synchronized wave. Decide your per-key dimension deliberately too: per-API-key is fair to paying customers, per-IP catches anonymous abuse but punishes everyone behind a corporate NAT, and per-endpoint lets you protect an expensive route without throttling cheap ones.

Pick the best fit

A public API serves bursty browser clients across 6 app nodes behind a load balancer. You want a fair average limit that tolerates normal page-load bursts. Pick the design.

Quiz

Your limit is 100/min using a fixed window. What is the worst-case number of requests a client can legitimately push in a ~1-second span?

Quiz

Why run the Redis rate-limit check as a single Lua script instead of an INCR followed by an EXPIRE?

Order the steps

Order what happens when a request hits a token-bucket limiter and is allowed:

1 Read the bucket's stored tokens and last-refill timestamp from Redis
2 Refill: add (elapsed time x rate) tokens, capped at capacity
3 Check: are there at least as many tokens as the request costs?
4 Subtract the request's cost from the token count
5 Write tokens + new timestamp back atomically and let the request through

Tokens refill at a steady rate up to capacity (the burst knob). Each request spends one; an empty bucket returns 429 with Retry-After until the next token refills.

N nodes with local counters means N× your configured limit — the boundary burst is invisible until a load test or an attack. The fix is one atomic Lua script in Redis: every node reads and writes the same token bucket, sub-millisecond, on the synchronous request path.

Recall before you leave

01
An in-memory rate limiter passes every test on your laptop but lets traffic through far above the configured limit in production. Explain the failure and the fix.
02
Why does a fixed window allow a 2x burst, and how do sliding window and token bucket each address it?

Recap

Rate limiting is a stack of tradeoffs. Fixed window is one integer and O(1) but leaks up to a 2x burst at the window boundary; the sliding window log fixes that exactly at the cost of a timestamp per request; the sliding window counter approximates it with two integers and roughly 0.003% error, which is why it is the common edge default; and token bucket models bursty real traffic directly, enforcing an average rate while allowing a bounded burst set by capacity, which is why Stripe and AWS reach for it. None of that matters if the counter lives in per-node memory — behind a load balancer that silently multiplies your limit by the node count, so the counter must live in a shared store, almost always Redis, updated atomically with a Lua script so concurrent requests cannot double-spend or leak a TTL-less key. When you reject, honor the contract: 429 with Retry-After plus the IETF RateLimit-* fields, expressed in delta-seconds and jittered so every throttled client does not stampede back at the same reset boundary. Now when you see a rate limiter that passes every test in CI but lets traffic sail through in production, your first question is: how many nodes, and where does the counter live?

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Mini OAuth 2.0 + PKCE loginImplement the authorization-code + PKCE flow end to end against a real provider, so you understand every redirect and token instead of trusting a library.Distributed rate limiterBuild a token-bucket limiter that holds across many app instances by keeping the counter in Redis, not in process memory.