Performance PERF · 06 · 09

Batching: code and config reading

Read real producer configs, a Go buffered-writer window, a split-and-retry loop, and a batcher metrics line; predict the behaviour and pick the highest-leverage fix.

PERF Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Batching bugs hide in config defaults, a missing timer, and an over-eager retry. Read the code and the metrics, then choose the fix a senior makes first.

Goal

Practise the loop you run on every batching path: read the config or hot loop, predict which trigger fires and what breaks, and reach for the highest-leverage fix before touching anything else.

Snippet 1 — the Kafka producer config

linger.ms=0
batch.size=16384
compression.type=zstd
acks=all

Quiz

A producer on this config pushes well below the cluster's capacity while brokers idle. compression.type is already zstd. What is the dominant problem and the first fix?

Snippet 2 — the Go buffered writer

func newBatcher(w io.Writer) *bufio.Writer {
    return bufio.NewWriterSize(w, 64*1024) // 64 KB buffer
}

// hot path: many goroutines call this
func emit(bw *bufio.Writer, rec []byte) error {
    _, err := bw.Write(rec)        // buffers; flushes only when full
    return err
}

Quiz

This batcher works great under load but its tail latency explodes when traffic drops overnight. What is missing, and why does the symptom appear only at low load?

Snippet 3 — the consumer retry loop

func process(batch []Record) error {
    for _, r := range batch {
        if err := handle(r); err != nil {
            return err            // abort whole batch, will be retried
        }
    }
    return commitOffsets(batch)
}

Quiz

One record in the batch is permanently malformed (handle always errors on it). The framework retries process(batch) on any returned error. What happens, and what is the right structure?

Snippet 4 — the batcher metrics line

batch.flush reason=timer size=120/4096 recs wait_ms=20.0 depth=12% drops=0

Quiz

Reading this single batcher metrics line, which statement is correct?

Recap

Batching is read in config and code: linger.ms=0 starves batches and neuters compression no matter the codec; a size-only buffer needs a max-wait timer or it stalls at low load; abort-whole-batch retry on a permanent error is a poison-message stall that split-and-retry plus a DLQ resolves; and a batcher metrics line tells you the flush reason, fill ratio, wait, depth, and drops at a glance — depth and drops lead, throughput lags. Diagnose from the signal, fix the highest-leverage cause, then re-measure.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.