Performance PERF · 06 · 06

Backpressure, failure isolation, and batch security in production

Poison messages, partial-batch failure, bounded queues (block/drop/spill), compression bombs, adaptive windowing — the operational edge cases that crash batched systems.

PERF Senior ◷ 18 min

Level

FoundationsJuniorMiddleSenior

Your Kafka consumer batches 500 messages per poll. One has a corrupt header. The deserializer throws. Without isolation, you’ve just lost — or replayed — 499 good messages with it. Batching multiplies every failure by the batch size, and the same multiplier is what an attacker reaches for. This is the lesson the throughput tutorials skip.

Lessons 01–05 covered the upside: amortize fixed cost, tune the window, hit the Pareto knee. This is the capstone, and it covers the downside — the operational edge cases that turn a throughput win into a 3 AM page. Every property that makes batching efficient (atomic consume, shared compression, one queue, one window) becomes a failure multiplier the moment something goes wrong. A senior who has run a batched pipeline in production thinks about these four things before the happy path is even merged.

Failure isolation: the poison-message problem

A batch is consumed atomically: you poll() 500 records, process them, then commit one offset. So if processing throws on item 47, what happens to items 1–46 and 48–500? The naive answers are all traps.

Retry the whole batch. Item 47 is still poison, so it throws again. The consumer never advances its offset. You have an infinite loop, consumer lag climbs without bound, and the partition is frozen behind one bad record. This is the classic poison-message stall — the kind that pages you because dashboards show “lag: 4.2M and rising” while the consumer looks healthy.
Skip the batch on error. You advance past all 500, losing 499 good messages to discard one bad one. For payments or orders that is data loss; for an audit log it can be a compliance incident.
Split-and-retry (bisect). On failure, halve the batch and retry each half. The failing half fails again — bisect only that half. Recurse until the batch is a single record, dead-letter that one, and commit the rest. You pay O(log N) extra requests to isolate one poison item out of N. For a 500-batch that is ~9 retries, not 500.

The cloud primitive for this is AWS Lambda’s BisectBatchOnFunctionError for Kinesis and DynamoDB Streams event sources: on a function error Lambda splits the batch in two and retries each half independently, narrowing to the offending record. Pair it with ReportBatchItemFailures, where the function returns the sequence number of the first failure so Lambda only retries from that point forward instead of re-running the whole window.

Each bisection throws away half the suspect window, so isolating one poison item out of 500 costs O(log N) ≈ 9 retries — not 500. The same N that amplifies the failure is what makes the search cheap.

None of this is complete without the back stop: a dead-letter queue (DLQ) plus a re-drive job. Records that fail past maximumRetryAttempts land in the DLQ (an SQS queue or a Kafka DLQ topic) with the original payload and error context. An operator inspects them out of band, fixes the schema or the bug, and re-drives the DLQ back into the main stream. A pipeline without a DLQ does not have failure isolation — it has a place where messages silently die.

Failure mode	Naive outcome	Senior fix
One item throws in a batch of N	Retry whole batch → infinite loop, lag explodes	Split-and-retry to isolate, then DLQ the one poison item
Producer faster than consumer	Unbounded in-mem queue → OOM kill	Bounded queue + an explicit overflow policy (block/drop/spill)
Compressed batch decompresses huge	Wire-size check passes, broker OOMs on expand	Enforce a post-decompression size cap, not just wire size
Duplicate payload replayed inside one batch	Per-batch dedup sees one key, processes all copies	Per-item idempotency key (nonce), validated per record
Static window wrong for current load	Over-batch at low load (latency) or under-batch at high load	Adaptive window: AIMD control loop on p99 vs SLO

Backpressure and bounded queues

Ask yourself: when your consumer falls behind, where does the mismatch go? If your answer is “the queue handles it,” the next question is: what happens when the queue is full?

At some point in every production system, producer rate exceeds consumer rate — a downstream slowdown, a GC pause, a deploy that doubles traffic. The queue between them is where that mismatch goes. The first decision is whether the queue is bounded at all.

Unbounded in-memory queue. Works beautifully until the producer outruns the consumer for long enough to fill the heap, then the JVM OOM-kills the process and you lose everything in flight. This is a classic anti-pattern — the queue that “never fills” until the one time it does, at 2 AM, taking the whole node with it.
Bounded + block. When the queue is full, the producer blocks. This is the heart of Reactive Streams: the consumer signals demand with request(n) and the producer may only emit that many — backpressure as a first-class protocol (Akka Streams, Project Reactor, RxJava all implement it). Latency degrades gracefully and throughput settles at the rate of the slowest stage. This is the right default when losing data costs more than added latency.
Bounded + drop. When full, discard the new item (or the oldest). Lossy by design, and correct when freshness beats completeness — StatsD over UDP is the canonical example: a dropped metric sample is invisible, but a producer that stalls on a full buffer breaks the very observability you needed.
Bounded + spill-to-disk. Page overflow to a local on-disk buffer (a write-ahead log), drained behind the in-memory queue. Durable across producer spikes at the cost of disk IOPS and added end-to-end latency. Vector, Fluentd, and Filebeat all offer this for log shipping.

One nuance worth getting right: spill-to-disk is a capacity lever, not the overflow policy itself — even a disk buffer is finite, so it still needs a when_full rule. Vector’s disk buffer defaults to when_full: block, propagating backpressure all the way to the source, and only switches to drop_newest when you explicitly choose to shed load. The point is that “spill” buys you a much larger, durable buffer before block-or-drop ever triggers.

The choice between these is a product decision, not a technical one. The question is: what is the cost of one lost item versus one second of added latency? For payments, a lost item is unacceptable — block. For a metrics firehose, a one-second stall blinds your dashboards — drop. Answer it once per pipeline and encode the answer in the queue config; don’t let it be an accident of which default the library shipped.

Pick the best fit

A payment-events pipeline: the consumer slows during a downstream incident and the bounded queue is filling. Pick the overflow policy a senior defends.

Security: the batch boundary as an attacker primitive

A batch boundary is leverage, and leverage is what attackers want. Three exploit classes recur.

Decompression bombs. A small compressed payload that expands to gigabytes on the receiver. The concrete Kafka incident is CVE-2023-34455 (with the incomplete-fix follow-up CVE-2023-43642): the snappy-java decompressor read a 4-byte chunk-length field and allocated a byte array of that size without an upper-bound check. A crafted frame declaring 0x7FFFFFFF made the broker try to allocate ~2 GB and throw OutOfMemoryError — a remote DoS affecting Kafka 0.8.0 through 3.5.0, fixed in snappy-java 1.1.10.1+. The general defense: validate the post-decompression size, not just the wire size. A 1 MB compressed batch can legitimately carry far more, so a wire-level limit is no protection at all.

Replay inside a batch. If your idempotency key is per-batch instead of per-item, an attacker submits one batch containing the same payload 100 times. The dedup check at the batch boundary sees one unique key and happily processes all 100 copies. Idempotency must be per-item, with a per-item nonce checked against a dedup store as each record is applied.

Reordering under partial failure. Batched writes to a KV store may not preserve submission order once split-and-retry kicks in: the retried half lands after the half that succeeded. If ordering carries meaning — event sourcing, CRDT merge, a state machine — attach a sequence number per item and have the consumer reject out-of-order records. Never assume batch order survives a retry.

The unifying rule: every batch property — total size, item count, idempotency key, ordering — must be validated post-decompression and per-item, never at the wire boundary or per-batch. A compressed batch is an opaque envelope; what’s inside is the attack surface.

Adaptive batching: closing the loop

A static window — linger.ms=10, batch.size=16384 — is wrong at every load except the one you tuned for. At low traffic it adds latency waiting for a batch that never fills; at high traffic it caps throughput below what the system could sustain. Adaptive systems treat the window as a control variable driven by a feedback signal.

Signal. Observed p99 latency against the SLO ceiling. When p99 < SLO − margin, grow the window for more throughput. When p99 > SLO, shrink it to protect latency.
AIMD (additive increase, multiplicative decrease). The exact control loop TCP congestion control uses: nudge the window up by a small step each healthy interval, halve it on an SLO breach. It probes gently for headroom and backs off hard under stress — the same instinct that keeps the internet stable applies to your batch window.
Production examples. Kafka’s batch.size is static, but linger.ms can be driven by a sidecar reacting to latency. Envoy’s adaptive concurrency filter runs a closely related control loop — a gradient controller (Netflix’s algorithm) that measures minRTT under light load and shrinks the in-flight concurrency limit as sampled latency rises above it. Same idea, controlling concurrency instead of window size.
Cost. You add an observability surface (per-window latency histograms) and a control loop that can itself oscillate if tuned badly. Worth it past roughly 100k QPS, where the gap between the right window at peak and at trough is large; below that, a well-chosen static window is simpler and fine.

▸Why this works

Why reuse TCP’s AIMD instead of a smarter controller? Because AIMD’s multiplicative decrease guarantees fast convergence away from overload, while additive increase keeps you from slamming back into it — the property that makes it stable under shared, noisy load. A PID controller can be tighter but is far easier to mistune into oscillation. For a batch window where the cost of over-shooting is an SLO breach, “back off hard, probe gently” is the conservative default.

The senior pre-ship checklist

Before a batched pipeline ships, a senior can answer all six of these. If any answer is “we didn’t decide,” that’s the gap that pages you.

Poison-message recovery path? Split-and-retry to isolate, DLQ the bad item, re-drive job to replay fixes.
Overflow behavior? Block, drop, or spill — chosen explicitly from the cost of a lost item vs added latency.
Maximum post-decompression payload size? A real cap on the expanded size, not just the wire size.
Per-item idempotency story? A per-item nonce checked against a dedup store, never a per-batch key.
Static or adaptive window? Static is fine below ~100k QPS — just know which you chose and why.
Observability surface? Batch-size histogram, queue-depth gauge, DLQ rate, consumer lag — the four signals that catch all of the above before they page you.

Quiz

A batch of 500 Kafka records throws on record 47. How does split-and-retry isolate the poison item?

Quiz

A 1 MB compressed Kafka batch passes the wire-size limit but OOMs the broker. What is the correct defense, per CVE-2023-34455?

Order the steps

Order the split-and-retry isolation steps for a poison record in a batch:

1 Processing throws somewhere in the batch; the whole batch is marked failed
2 Halve the batch and retry each half independently
3 Recurse only into the half that still fails, ignoring the half that passed
4 Bisect down to a single poison record
5 Dead-letter that one record and commit the offsets of all the good ones

When the consumer can't keep up, a bounded queue applies an explicit overflow policy (block, drop, or spill) and signals the producer to slow. An unbounded queue instead grows until the heap is exhausted and the process is OOM-killed.

Recall before you leave

01
A consumer batches 500 Kafka messages per poll. Message 47 throws during deserialization. Walk through the split-and-retry isolation pattern and what makes it cheap.
02
A producer is faster than the consumer and the bounded queue is full. What changes between block, drop, and spill-to-disk, and how do you choose?
03
Why is per-batch idempotency unsafe, and how does CVE-2023-34455 generalize the lesson about batch boundaries?

Recap

Production batching fails in ways the throughput-tuning lessons don’t cover, and each one is the batch’s efficiency turned against you. One bad item kills a whole batch — fix it with split-and-retry to isolate the poison record in O(log N) requests, then dead-letter it and re-drive (AWS Lambda’s BisectBatchOnFunctionError is this primitive). One slow consumer overflows the queue — bound it and pick block, drop, or spill on purpose, because the choice is a product decision about the cost of a lost item versus latency, not a technical accident. One malicious batch amplifies an attack by orders of magnitude — defend with post-decompression size caps (CVE-2023-34455), per-item idempotency, and sequence numbers, because a compressed batch is an opaque envelope you must validate per-item. Static windows are fine below ~100k QPS; above it, run an AIMD control loop on observed p99 versus SLO, the same congestion logic TCP uses. The six-item pre-ship checklist — poison path, overflow policy, decompression cap, per-item nonce, window choice, observability — is what separates a batching tutorial from a batching pipeline. Now when you ship a batched pipeline, run the checklist before you merge the happy path — the items you cannot answer are the ones that will page you at 3 AM.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

appears again in289

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Collaborative cursorsShow every connected user's live cursor and selection in a shared document, conflict-free, over WebSocket.