AI / LLM Integration
Streaming: build a robust streaming endpoint and client
Reading about TTFT, partial JSON, and buffering proxies is not the same as shipping a stream that survives a slow client, a Stop button, a dropped connection, and a default nginx. Build the endpoint and client end-to-end, measure the latency that matters, and prove it holds under each failure.
Turn the unit’s mental model into a working system: relay a provider’s SSE stream through your own endpoint, accumulate deltas correctly (text and tool args), measure TTFT and TPOT, cancel cleanly all the way to the upstream, recover from errors, and verify a real proxy doesn’t buffer the stream to death.
Build a streaming chat endpoint (server) and client that relays an LLM provider's SSE stream, measures TTFT/TPOT, supports mid-stream cancellation that propagates to the upstream, recovers from errors and drops, and is verified to stream incrementally through a real reverse proxy — not just localhost.
- A demo showing tokens rendering incrementally end-to-end through the proxy, with measured TTFT in the hundreds of ms (not equal to total time) and a TPOT figure — captured numbers, not estimates.
- A tool call whose arguments stream as fragments and parse exactly once at content_block_stop; show a log line proving no intermediate JSON.parse was attempted and the final args object is complete.
- Clicking Stop visibly halts tokens AND a server log / provider dashboard confirms the upstream generation aborted (token count stops climbing) — not just a frozen UI.
- A before/after capture of the buffering proxy: with the fix, incremental delivery; with proxy_buffering on, the all-at-once flush after a long pause — plus a one-paragraph note naming the exact config that caused and cured it.
- Add server backpressure handling: when res.write() returns false, pause the upstream stream and resume on 'drain', and prove memory stays bounded against an artificially slow client (throttled consumer).
- Add a heartbeat/keepalive comment frame (': ping') on an interval so idle-timeout proxies and load balancers don't kill a slow-to-first-token reasoning request.
- Add a reasoning-model path: stream thinking/reasoning summaries (or a progress state) so a 10–60s TTFT never shows a bare frozen spinner, and measure the perceived-latency difference.
- Add an on-call runbook page: the 'works on localhost, batches in prod' triage, the proxy/CDN/serverless buffering checklist, the empty-tool-args (lost input_json_delta) diagnosis, and the cancellation/idempotency contract.
This is the loop you will run on every real streaming feature: relay SSE correctly (right headers, buffer-and-split parsing, delta accumulation, tool args parsed only at content_block_stop), measure the latency that actually matters (TTFT and TPOT, not total time), cancel all the way to the upstream so a Stop button stops billing, recover from drops with a safe whole-turn retry, and verify against a real proxy because the number-one production failure — buffering that turns TTFT back into total time — never shows up on localhost. Build it once on a toy chat app and the production version becomes muscle memory.