Networking & Protocols NET · 11 · 03

Rate limiting: algorithms and architecture

Token bucket, leaky bucket, fixed window, and sliding window counter each make a different tradeoff between accuracy, memory, burst tolerance, and implementation complexity.

NET Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Your API serves 100 req/sec on average. During a DDoS it gets 100,000 req/sec. You could block by IP, but the attacker uses a thousand different IPs each doing 100 req/sec. Any per-IP limit at 100 req/sec would also block your legitimate users. The question is not “how many requests?” but “which algorithm correctly defines abuse vs normal traffic?”

Token bucket. Tokens accumulate in a bucket at a fixed rate R (e.g., 100 tokens/sec) up to a maximum capacity C (e.g., 1,000 tokens). Each request costs one token; if the bucket is empty, the request is rejected. Allows controlled bursts: if the bucket is full at rest, the first 1,000 requests go through instantly, then the rate caps at 100/sec. This is the most common default for public APIs because it is simple (one counter + timestamp) and allows legitimate traffic spikes (a user refreshing their browser tab 20 times in a second).

Leaky bucket. Requests queue and drain at a fixed rate D, smoothing traffic. A burst of 1,000 requests does not get through instantly — they enter the queue and drain at D per second. This adds latency for legitimate burst traffic but produces perfectly smooth output. Useful for downstream services that cannot handle bursts, not for user-facing APIs where latency matters.

Fixed window. Count requests in a time window (e.g., per second). Simple but has the boundary-burst problem: two requests at 59.999 seconds and two at 60.001 seconds counts as two in each window — four requests across a 4-millisecond span — but the window logic sees them as separate windows. An attacker exploiting this can double the effective rate limit by timing requests at the window boundary.

Sliding window log. Store every request timestamp in a list; reject if more than N timestamps within the last T seconds. Most accurate — no boundary problem — but memory scales with request count. At 10,000 req/sec, storing each timestamp for 1 second requires 10,000 entries per client. Impractical for high-traffic APIs.

Sliding window counter. Blend fixed-window efficiency with sliding-window accuracy. Keep counters for the previous window (prev) and current window (curr). At request time T: estimated_count = prev * (1 - elapsed/window_size) + curr. If estimated_count < limit, allow; else reject. Memory: O(1) per client (two counters). Accuracy: approximate but much better than fixed window. This is the practical default for distributed systems.

The O(1) algorithms hold one or two counters per client regardless of traffic; the sliding window log stores one timestamp per request, so at 10,000 req/sec it holds 10,000 entries per client — the gap that makes it impractical at scale.

Algorithm	Burst handling	Memory	Boundary problem	Best for
Token bucket	Allows burst (up to C)	O(1)	None	Public APIs, user-facing
Leaky bucket	Smooths burst (adds latency)	O(queue)	None	Downstream rate smoothing
Fixed window	None (boundary problem)	O(1)	Yes	Simple internal counters
Sliding window log	Accurate, no burst	O(n requests)	None	Low-traffic strict limits
Sliding window counter	Approximate, good accuracy	O(1)	Minimal	Distributed systems

Token bucket rate limiting

1/3

Where to enforce rate limits. The algorithm choice only matters if you enforce limits at the right layer — enforce at the wrong one and the attacker simply routes around it. Three layers, each stopping different attack patterns:

Edge (CDN/scrubbing center) — blocks the largest attacks before they enter your network. Per-IP, per-ASN, or per-country limits. No application context available.
Gateway/load balancer — can rate-limit per user, per API key, or per origin connection. Has auth context. Stops bot-like patterns that pass edge limits.
Application layer — can rate-limit by business logic (e.g., max 3 failed login attempts per minute, max 10 database queries per request). Slowest but most contextual.

Enforce at all three layers: edge stops volumetric floods, gateway stops bot-like patterns, app-layer stops abuse by authenticated users.

Distributed rate limiting via Redis. When running multiple app servers, a local in-memory counter is insufficient — a client can hit different servers and each sees only its share of the traffic. Solution: shared counter in Redis. Every request atomically increments a Redis key and checks the limit. The cost: each request decision adds a Redis round-trip (0.5–1 ms on the same network). At 100k req/sec, that is 50–100 ms of added latency per request on Redis. Mitigation: local rate limiting (per-server counter, no Redis) plus occasional Redis sync for global stats — trades slight inaccuracy for microsecond-latency decisions.

Quiz

Why does the fixed window algorithm have a boundary-burst problem?

Adaptive/concurrency limiting. Instead of a per-second count, limit the number of in-flight requests. When in-flight >= max_concurrency, reject new requests. This prevents cascading failure (if one backend slows down, its queue backs up; rejecting new requests to it keeps other backends responsive). Load shedding: when queue depth exceeds a threshold, reject new requests probabilistically — survival probability = max_queue / queue_depth. Users see fast 503 errors instead of timeouts; the service stays responsive for others. The tradeoff: users get errors, not slow responses, during an overload.

Order the steps

Order the rate-limiting algorithms by memory efficiency (lowest memory first):

1 Fixed window (one counter per window)
2 Sliding window counter (two counters per client)
3 Token bucket (one counter + timestamp)
4 Sliding window log (store every request timestamp)

Quiz

In production, you raise your WAF to Paranoia Level 3 and false positives jump from 0.2% to 3%. What should you do first?

▸Why this works

Why is token bucket the most popular default for public APIs? Three reasons: (1) it allows legitimate burst traffic — a user who was inactive for 10 seconds has a full bucket and can make 1,000 requests instantly without hitting limits; (2) it is O(1) memory per client; (3) it is simple to implement in Redis with a single INCR + expiry. The boundary-burst problem of fixed window causes real production incidents — a user at the window boundary can legitimately hit 2x the limit — so most APIs that care about accuracy use token bucket or sliding window counter instead.

Tokens accumulate at rate R up to capacity C. Each arriving request takes one token; if the bucket is non-empty the request is allowed, and when the bucket is empty the request is rejected. A full bucket lets a burst of up to C requests through instantly, then the rate caps at R.

Recall before you leave

01
Explain why distributed rate limiting via Redis adds latency and how you would measure the impact.
02
What is load shedding and when is it preferable to rate limiting?
03
A token bucket has capacity 1000 and refill rate 100/sec. A user burst-drains it in 0.1 seconds. How long before they can make 200 more requests?

Recap

Rate limiting algorithms trade off burst tolerance, memory, and accuracy. Token bucket (O(1), allows bursts) and sliding window counter (O(1), no boundary problem) are the practical choices for production APIs. Fixed window has the boundary-burst problem — two half-windows of requests can exceed the limit in a small time span. Sliding window log is most accurate but O(n) memory makes it impractical at scale. Distributed systems need Redis-backed counters (adds ~0.5–1 ms per decision) or local counters with global sync. Beyond per-second rate limits, adaptive concurrency limiting rejects requests when in-flight count exceeds the server’s capacity — stopping overloads without requiring attacker IP detection. Now when you are asked to design rate limiting for a new API, start with token bucket at the edge and sliding window counter at the gateway — then add Redis sync only once you have measured that local counters are letting too much through.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

DDoS: what it is and why it worksjunior

unlocks

deepens into

Defense-in-depth architecture and attack economicssenior

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.