Queues, Streams, Eventing QUE · 02 · 09

Kafka partitions: code and config reading

Read real producer code, consumer config, partition math, and a rebalance log, predict the behaviour, and pick the highest-leverage fix.

QUE Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Partition bugs are diagnosed in producer code, consumer config, and rebalance logs — not in prose. Read each snippet, predict what Kafka does with it, and choose the fix a senior engineer would make first.

Goal

Practise the loop you run in every Kafka incident: read the producer key, the consumer config, the partition math, and the rebalance log, then reach for the fix that respects ordering and parallelism rather than papering over it.

Snippet 1 — the producer key

// Order events: created, paid, shipped, cancelled all flow through here
ProducerRecord<String, OrderEvent> record =
    new ProducerRecord<>("order-events", event.getType(), event);
//                                       ^^^^^^^^^^^^^^^^^
//                                       key = event TYPE, not order id
producer.send(record);

Quiz

A consumer must process each order's events in order (created before cancelled). With this keying, what happens, and what is the fix?

Snippet 2 — the partition math

# Default murmur2-style partitioner: partition = hash(key) % N
def partition_for(key, N):
    return hash(key) % N

# orderId "A-4711" before and after a partition increase
partition_for("A-4711", 6)    # -> 2
partition_for("A-4711", 12)   # -> 8     # same key, different partition!

Quiz

A topic was raised from 6 to 12 partitions while live. Reading this math, what is the consequence for key A-4711, and why can't you undo it?

Snippet 3 — the consumer config

# Consumer group: payments-processor, 4 instances behind a rolling deploy
group.id=payments-processor
session.timeout.ms=10000
heartbeat.interval.ms=3000
# group.instance.id is NOT set
partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor

Quiz

Every rolling deploy causes a multi-second consumption stall across the whole group. Reading this config, what is the cause, and what is the lowest-risk change?

Snippet 4 — the rebalance log

[Consumer clientId=c3, groupId=payments] (Re-)joining group
[Consumer clientId=c3, groupId=payments] Lost previously assigned partitions order-events-2, order-events-5
[Consumer clientId=c3] Revoke previously assigned partitions order-events-2, order-events-5
... 7 such cycles in 90 seconds ...
[Consumer clientId=c3] Member c3 sending LeaveGroup request due to consumer poll timeout has expired

Quiz

Reading this log — repeated join/revoke cycles ending in a poll-timeout LeaveGroup — what is the failure mode, and what is the first fix?

Recap

Every partition incident is read in code, config, and logs: the producer key decides what stays ordered (key by the entity, not the event type); hash(key) % N means a partition increase reroutes live keys and can never be undone; an eager assignor plus a missing group.instance.id makes every deploy a stop-the-world rebalance; and a join/revoke loop ending in poll-timeout is a rebalance storm from slow processing, not a broker or partition problem. Read the key and the config first, fix the structural cause, then confirm against lag and rebalance metrics.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.