Crux Read real producer configs, a Go buffered-writer window, a split-and-retry loop, and a batcher metrics line; predict the behaviour and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Batching bugs hide in config defaults, a missing timer, and an over-eager retry. Read the code and the metrics, then choose the fix a senior makes first.
Goal
Practise the loop you run on every batching path: read the config or hot loop, predict which trigger fires and what breaks, and reach for the highest-leverage fix before touching anything else.
A producer on this config pushes well below the cluster's capacity while brokers idle. compression.type is already zstd. What is the dominant problem and the first fix?
Heads-up acks=all adds replication latency per request, but with linger.ms=0 the real cap is tiny batches — you're sending far more requests than needed. Fix the batching first; relax acks only if durability allows and it's still slow.
Heads-up Codec choice is a second-order tweak. With linger.ms=0 the codec barely runs on near-empty batches — the problem is empty batches, not which codec compresses them.
Heads-up 16 KB is the default and is rarely too large — and a smaller cap makes batches even tinier, worsening throughput. The producer isn't filling 16 KB anyway because linger.ms=0 flushes first.
Snippet 2 — the Go buffered writer
func newBatcher(w io.Writer) *bufio.Writer { return bufio.NewWriterSize(w, 64*1024) // 64 KB buffer}// hot path: many goroutines call thisfunc emit(bw *bufio.Writer, rec []byte) error { _, err := bw.Write(rec) // buffers; flushes only when full return err}
Quiz
Completed
This batcher works great under load but its tail latency explodes when traffic drops overnight. What is missing, and why does the symptom appear only at low load?
Heads-up That's a real concern for correctness (bufio.Writer isn't goroutine-safe), but it's not what makes tail latency explode at low load. The missing max-wait timer is the cause of the stall described.
Heads-up A larger buffer makes the low-load stall worse — it takes even longer to fill. The fix is a timer that flushes a partial buffer, not a bigger size cap.
Heads-up A micro-optimization unrelated to the latency cliff. The structural defect is the absence of a max-wait flush, which is exactly what bites when the size trigger can't fire.
Snippet 3 — the consumer retry loop
func process(batch []Record) error { for _, r := range batch { if err := handle(r); err != nil { return err // abort whole batch, will be retried } } return commitOffsets(batch)}
Quiz
Completed
One record in the batch is permanently malformed (handle always errors on it). The framework retries process(batch) on any returned error. What happens, and what is the right structure?
Heads-up commitOffsets runs only after the loop completes successfully; an early return commits nothing. Every retry reprocesses the whole batch and re-hits the poison record.
Heads-up Skipping the whole batch (if it even did) would discard every good record to drop one bad one — data loss. The correct mechanism is per-item isolation via split-and-retry plus a DLQ, not blanket skip.
Heads-up Idempotency makes retries safe to repeat, but the record still errors forever, so the offset still never advances. Idempotency doesn't break a poison-message stall; failure isolation does.
Reading this single batcher metrics line, which statement is correct?
Heads-up Backwards: drops=0 means no data lost, and 12% depth means the buffer is mostly empty — both are the healthy signals. The only mild concern is the timer-bound, under-full batch.
Heads-up 120/4096 is records-used over the cap — the batch holds 120 of a possible 4096, i.e. it's 3% full, the opposite of overflow.
Heads-up 20 ms is the configured max-wait, and it's the latency tax, not a CPU cost. Whether to shrink it depends on the SLO and whether batches are filling — here they're under-full, so shrinking the wait just makes batches even smaller.
Recap
Batching is read in config and code: linger.ms=0 starves batches and neuters compression no matter the codec; a size-only buffer needs a max-wait timer or it stalls at low load; abort-whole-batch retry on a permanent error is a poison-message stall that split-and-retry plus a DLQ resolves; and a batcher metrics line tells you the flush reason, fill ratio, wait, depth, and drops at a glance — depth and drops lead, throughput lags. Diagnose from the signal, fix the highest-leverage cause, then re-measure.