Backend Architecture
Idempotency and retries: build an effectively-once payment path
Reading about effectively-once is not the same as proving your service never double-charges. Build a small payment path end to end — idempotent endpoint, retrying client, outbox relay, idempotent consumer — then attack it with duplicate requests, a crashed relay, and a network that eats responses, and show the side effect happens exactly once every time.
Turn the unit’s mental model into a working system: a server-side idempotency state machine, a client that retries with full jitter, an outbox + inbox that closes the dual-write gap, and the observability that lets you assert effectively-once with evidence rather than belief.
Build a POST /charge service plus a downstream inventory/ledger consumer that together deliver effectively-once behaviour, then prove — under injected duplicate requests, concurrency, a lost response, and a relay crash — that the money-moving side effect fires exactly once per logical operation.
- Under 1000 logical charge operations with duplicates and concurrency injected, the applied-side-effect count equals exactly 1000 — no double charge, no lost charge — shown from the metrics, not asserted.
- A concurrency test fires two simultaneous requests with the same key and shows exactly one processes (200) and the other gets 409 or a replay, never two charges.
- A relay-crash test (kill after publish, before marking published) shows the event re-published on restart and the consumer's inbox dedup keeping the business effect at one application.
- A key-reuse-with-changed-body test returns 422, and a transcript shows the retrying client honoured Retry-After and used full jitter (delays are spread, not synchronised).
- Add a two-tier cache: Redis SETNX hot path backed by the Postgres ledger, and show a simulated Redis key loss falls through to Postgres and still dedups.
- Add a fleet retry budget (token bucket per call site) plus a dead-letter queue for poison messages after N attempts, and show a downstream outage does not amplify into a retry storm.
- Add a one-page on-call runbook: triage from the four metrics, what a 422 spike vs an attempt-3 surge vs growing outbox lag each means, and the fix for each.
- Rate-limit the relay's catch-up publish rate and show that a large outbox backlog (e.g. 18k rows after a broker outage) flushes without a self-inflicted thundering herd on the consumer.
This is the system behind every payment API that does not double-charge: an idempotent endpoint resolving each key to new/409/replay/422 with atomic creation, a client retrying with full jitter under a budget, an outbox that makes the dual-write atomic, an inbox that dedups on the consumer, and metrics that let you assert the side-effect count equals the operation count. Building it once — and breaking it on purpose with duplicates, concurrency, lost responses, and a crashed relay — is what turns effectively-once from a phrase into something you can defend in a postmortem.