awesome-everything RU
↑ Back to the climb

Networking & Protocols

Rate limiting: algorithms and architecture

Crux Token bucket, leaky bucket, fixed window, and sliding window counter each make a different tradeoff between accuracy, memory, burst tolerance, and implementation complexity.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

Your API serves 100 req/sec on average. During a DDoS it gets 100,000 req/sec. You could block by IP, but the attacker uses a thousand different IPs each doing 100 req/sec. Any per-IP limit at 100 req/sec would also block your legitimate users. The question is not “how many requests?” but “which algorithm correctly defines abuse vs normal traffic?”

Token bucket. Tokens accumulate in a bucket at a fixed rate R (e.g., 100 tokens/sec) up to a maximum capacity C (e.g., 1,000 tokens). Each request costs one token; if the bucket is empty, the request is rejected. Allows controlled bursts: if the bucket is full at rest, the first 1,000 requests go through instantly, then the rate caps at 100/sec. This is the most common default for public APIs because it is simple (one counter + timestamp) and allows legitimate traffic spikes (a user refreshing their browser tab 20 times in a second).

Leaky bucket. Requests queue and drain at a fixed rate D, smoothing traffic. A burst of 1,000 requests does not get through instantly — they enter the queue and drain at D per second. This adds latency for legitimate burst traffic but produces perfectly smooth output. Useful for downstream services that cannot handle bursts, not for user-facing APIs where latency matters.

Fixed window. Count requests in a time window (e.g., per second). Simple but has the boundary-burst problem: two requests at 59.999 seconds and two at 60.001 seconds counts as two in each window — four requests across a 4-millisecond span — but the window logic sees them as separate windows. An attacker exploiting this can double the effective rate limit by timing requests at the window boundary.

Sliding window log. Store every request timestamp in a list; reject if more than N timestamps within the last T seconds. Most accurate — no boundary problem — but memory scales with request count. At 10,000 req/sec, storing each timestamp for 1 second requires 10,000 entries per client. Impractical for high-traffic APIs.

Sliding window counter. Blend fixed-window efficiency with sliding-window accuracy. Keep counters for the previous window (prev) and current window (curr). At request time T: estimated_count = prev * (1 - elapsed/window_size) + curr. If estimated_count < limit, allow; else reject. Memory: O(1) per client (two counters). Accuracy: approximate but much better than fixed window. This is the practical default for distributed systems.

AlgorithmBurst handlingMemoryBoundary problemBest for
Token bucketAllows burst (up to C)O(1)NonePublic APIs, user-facing
Leaky bucketSmooths burst (adds latency)O(queue)NoneDownstream rate smoothing
Fixed windowNone (boundary problem)O(1)YesSimple internal counters
Sliding window logAccurate, no burstO(n requests)NoneLow-traffic strict limits
Sliding window counterApproximate, good accuracyO(1)MinimalDistributed systems

Token bucket rate limiting

1/3

Where to enforce rate limits. Three layers, each stopping different attack patterns:

  1. Edge (CDN/scrubbing center) — blocks the largest attacks before they enter your network. Per-IP, per-ASN, or per-country limits. No application context available.
  2. Gateway/load balancer — can rate-limit per user, per API key, or per origin connection. Has auth context. Stops bot-like patterns that pass edge limits.
  3. Application layer — can rate-limit by business logic (e.g., max 3 failed login attempts per minute, max 10 database queries per request). Slowest but most contextual.

Enforce at all three layers: edge stops volumetric floods, gateway stops bot-like patterns, app-layer stops abuse by authenticated users.

Distributed rate limiting via Redis. When running multiple app servers, a local in-memory counter is insufficient — a client can hit different servers and each sees only its share of the traffic. Solution: shared counter in Redis. Every request atomically increments a Redis key and checks the limit. The cost: each request decision adds a Redis round-trip (0.5–1 ms on the same network). At 100k req/sec, that is 50–100 ms of added latency per request on Redis. Mitigation: local rate limiting (per-server counter, no Redis) plus occasional Redis sync for global stats — trades slight inaccuracy for microsecond-latency decisions.

Quiz

Why does the fixed window algorithm have a boundary-burst problem?

Adaptive/concurrency limiting. Instead of a per-second count, limit the number of in-flight requests. When in-flight >= max_concurrency, reject new requests. This prevents cascading failure (if one backend slows down, its queue backs up; rejecting new requests to it keeps other backends responsive). Load shedding: when queue depth exceeds a threshold, reject new requests probabilistically — survival probability = max_queue / queue_depth. Users see fast 503 errors instead of timeouts; the service stays responsive for others. The tradeoff: users get errors, not slow responses, during an overload.

Order the steps

Order the rate-limiting algorithms by memory efficiency (lowest memory first):

  1. 1 Fixed window (one counter per window)
  2. 2 Sliding window counter (two counters per client)
  3. 3 Token bucket (one counter + timestamp)
  4. 4 Sliding window log (store every request timestamp)
Quiz

In production, you raise your WAF to Paranoia Level 3 and false positives jump from 0.2% to 3%. What should you do first?

Why this works

Why is token bucket the most popular default for public APIs? Three reasons: (1) it allows legitimate burst traffic — a user who was inactive for 10 seconds has a full bucket and can make 1,000 requests instantly without hitting limits; (2) it is O(1) memory per client; (3) it is simple to implement in Redis with a single INCR + expiry. The boundary-burst problem of fixed window causes real production incidents — a user at the window boundary can legitimately hit 2x the limit — so most APIs that care about accuracy use token bucket or sliding window counter instead.

Recall before you leave
  1. 01
    Explain why distributed rate limiting via Redis adds latency and how you would measure the impact.
  2. 02
    What is load shedding and when is it preferable to rate limiting?
  3. 03
    A token bucket has capacity 1000 and refill rate 100/sec. A user burst-drains it in 0.1 seconds. How long before they can make 200 more requests?
Recap

Rate limiting algorithms trade off burst tolerance, memory, and accuracy. Token bucket (O(1), allows bursts) and sliding window counter (O(1), no boundary problem) are the practical choices for production APIs. Fixed window has the boundary-burst problem — two half-windows of requests can exceed the limit in a small time span. Sliding window log is most accurate but O(n) memory makes it impractical at scale. Distributed systems need Redis-backed counters (adds ~0.5–1 ms per decision) or local counters with global sync. Beyond per-second rate limits, adaptive concurrency limiting rejects requests when in-flight count exceeds the server’s capacity — stopping overloads without requiring attacker IP detection.

Connected lessons
appears again in258
Continue the climb ↑WAFs, firewalls, mTLS, and HSTS
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.