Crux Read real producer code, consumer config, partition math, and a rebalance log, predict the behaviour, and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Partition bugs are diagnosed in producer code, consumer config, and rebalance logs — not in prose. Read each snippet, predict what Kafka does with it, and choose the fix a senior engineer would make first.
Goal
Practise the loop you run in every Kafka incident: read the producer key, the consumer config, the partition math, and the rebalance log, then reach for the fix that respects ordering and parallelism rather than papering over it.
Snippet 1 — the producer key
// Order events: created, paid, shipped, cancelled all flow through hereProducerRecord<String, OrderEvent> record = new ProducerRecord<>("order-events", event.getType(), event);// ^^^^^^^^^^^^^^^^^// key = event TYPE, not order idproducer.send(record);
Quiz
Completed
A consumer must process each order's events in order (created before cancelled). With this keying, what happens, and what is the fix?
Heads-up Order is guaranteed only within a partition, never across a topic. Keying by type sends 'created' and 'cancelled' for the same order to different partitions, which have no order between them.
Heads-up Hard-coding a partition would force everything onto one partition (no parallelism) and still wouldn't tie an order's events together correctly. The fix is to key by the entity (orderId), letting the partitioner co-locate it.
Heads-up acks=all is a durability setting about replica acknowledgement; it has nothing to do with which partition a record lands on. The ordering break is caused by the key choice.
Snippet 2 — the partition math
# Default murmur2-style partitioner: partition = hash(key) % Ndef partition_for(key, N): return hash(key) % N# orderId "A-4711" before and after a partition increasepartition_for("A-4711", 6) # -> 2partition_for("A-4711", 12) # -> 8 # same key, different partition!
Quiz
Completed
A topic was raised from 6 to 12 partitions while live. Reading this math, what is the consequence for key A-4711, and why can't you undo it?
Heads-up Existing records are never relocated on a partition increase. Only future writes use the new modulo, which is exactly what splits the key across partition 2 and partition 8.
Heads-up hash(key) is stable, but the partition is hash(key) % N, and N changed from 6 to 12. The modulo result changes for most keys — here from 2 to 8.
Heads-up Partition count can only increase, never decrease. Kafka refuses to merge partitions because there is no correct way to interleave two independent offset timelines.
Snippet 3 — the consumer config
# Consumer group: payments-processor, 4 instances behind a rolling deploygroup.id=payments-processorsession.timeout.ms=10000heartbeat.interval.ms=3000# group.instance.id is NOT setpartition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor
Quiz
Completed
Every rolling deploy causes a multi-second consumption stall across the whole group. Reading this config, what is the cause, and what is the lowest-risk change?
Heads-up Lowering the session timeout makes the group declare members dead faster — causing MORE rebalances, not fewer. Static membership lets a quick restart rejoin as the same member without any rebalance.
Heads-up A 3s heartbeat at a 10s session timeout is a normal ratio and is not the cause of the deploy stalls. The stall is the eager rebalance revoking all partitions on each restart.
Heads-up CooperativeStickyAssignor exists precisely to avoid stop-the-world revocation by moving only the partitions that must change. Combined with static membership, deploy-time rebalances mostly disappear.
Snippet 4 — the rebalance log
[Consumer clientId=c3, groupId=payments] (Re-)joining group[Consumer clientId=c3, groupId=payments] Lost previously assigned partitions order-events-2, order-events-5[Consumer clientId=c3] Revoke previously assigned partitions order-events-2, order-events-5... 7 such cycles in 90 seconds ...[Consumer clientId=c3] Member c3 sending LeaveGroup request due to consumer poll timeout has expired
Quiz
Completed
Reading this log — repeated join/revoke cycles ending in a poll-timeout LeaveGroup — what is the failure mode, and what is the first fix?
Heads-up The consumer is reaching the group and being assigned partitions each cycle — the log shows successful joins. The problem is it can't poll again in time, not connectivity.
Heads-up A partition is owned by exactly one consumer at a time; there is no co-ownership to fight over. The revokes are the rebalance protocol reacting to c3's repeated poll-timeout eviction.
Heads-up Partition count is unrelated to a poll-timeout eviction loop. The consumer is being kicked for slow processing between polls, so the fix is in poll cadence / batch size, not the topic.
Recap
Every partition incident is read in code, config, and logs: the producer key decides what stays ordered (key by the entity, not the event type); hash(key) % N means a partition increase reroutes live keys and can never be undone; an eager assignor plus a missing group.instance.id makes every deploy a stop-the-world rebalance; and a join/revoke loop ending in poll-timeout is a rebalance storm from slow processing, not a broker or partition problem. Read the key and the config first, fix the structural cause, then confirm against lag and rebalance metrics.