Performance
Batching: build and harden an ingest pipeline
Reading about windows, backpressure, and poison messages is not the same as running a pipeline that survives them. Build a small batching ingest path, tune its window against a latency SLO, then drive it into each production failure mode and prove your defenses hold — with evidence at every step.
Turn the unit’s mental model into a reproducible engineering loop: amortize a fixed cost with a size+time window, tune that window to an SLO from measured data, then harden the pipeline against overflow, poison messages, and a decompression bomb — verifying each with before/after numbers.
Build an ingest pipeline whose hot path is dominated by a fixed per-operation cost, batch it behind a size+time window tuned to a stated p99 SLO, and harden it against the three production failure modes — overflow, poison messages, and a decompression bomb — proving the throughput win and each defense with measurements, not estimates.
- A before/after table: throughput (items/s), fixed-cost amortization (syscalls or round-trips or commits per item), and p99 end-to-end latency — measured under the same load, with batching showing a clear throughput win while p99 stays under the SLO.
- Evidence the window is tuned, not guessed: the batch-size histogram fills near max-size at peak (throughput-bound) and the timer bounds latency at low load, with p99 from the bursty replay sitting under the SLO.
- Overflow demo: a graph or log showing the bounded queue applying the chosen policy under producer>consumer load, the drop counter behaving as designed, and the unbounded variant OOM-killed for contrast.
- Poison-message demo: consumer lag stays flat while split-and-retry isolates and dead-letters the bad record and commits the rest — versus the abort-whole-batch baseline whose lag spirals.
- Decompression-bomb demo: the post-decompression cap rejects the expanding payload with a bounded-memory error instead of OOMing the process.
- A one-paragraph write-up: which fixed cost you amortized, why you chose your window and overflow policy from the SLO and the lost-item-vs-latency tradeoff, and which metric you'd alert on first.
- Make the window adaptive: run an AIMD control loop on observed p99 vs the SLO (grow the window when p99 < SLO − margin, halve it on a breach) and show it beats the best static window across both a low-load and a high-load regime.
- Add compression and prove it needs batching: compare bytes-on-the-wire with linger=0 (tiny batches, codec useless) vs a filled batch (2–4× compression), confirming the codec only bites on fat batches.
- Add a DLQ plus a re-drive job: after isolating the poison record, fix the underlying schema bug and replay the DLQ back into the main stream, showing zero data loss end to end.
- Reproduce the Nagle/delayed-ACK 40 ms stall on a request/response variant of the path (small write then wait), then fix it with TCP_NODELAY and show the latency floor disappear.
This is the loop you will run on every real batching path: confirm the fixed cost dominates, add a size+time window, instrument size/wait/depth/drops, then tune the window from the SLO and validate on replayed bursty traffic — never chase max throughput on a whiteboard. Then harden it, because every efficiency property is a failure multiplier: bound the queue and choose block/drop/spill from the lost-item-vs-latency tradeoff, isolate poison messages with split-and-retry plus a DLQ, and validate the batch boundary post-decompression and per-item. Doing it once on a toy pipeline makes the production version muscle memory.