awesome-everything RU
↑ Back to the climb

AI / LLM Integration

Streaming: build a robust streaming endpoint and client

Crux Hands-on project — build a robust streaming endpoint and client end-to-end: SSE relay, TTFT/TPOT measurement, mid-stream cancellation, error recovery, and a buffering-proxy deployment check.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about TTFT, partial JSON, and buffering proxies is not the same as shipping a stream that survives a slow client, a Stop button, a dropped connection, and a default nginx. Build the endpoint and client end-to-end, measure the latency that matters, and prove it holds under each failure.

Goal

Turn the unit’s mental model into a working system: relay a provider’s SSE stream through your own endpoint, accumulate deltas correctly (text and tool args), measure TTFT and TPOT, cancel cleanly all the way to the upstream, recover from errors, and verify a real proxy doesn’t buffer the stream to death.

Project
0 of 7
Objective

Build a streaming chat endpoint (server) and client that relays an LLM provider's SSE stream, measures TTFT/TPOT, supports mid-stream cancellation that propagates to the upstream, recovers from errors and drops, and is verified to stream incrementally through a real reverse proxy — not just localhost.

Requirements
Acceptance criteria
  • A demo showing tokens rendering incrementally end-to-end through the proxy, with measured TTFT in the hundreds of ms (not equal to total time) and a TPOT figure — captured numbers, not estimates.
  • A tool call whose arguments stream as fragments and parse exactly once at content_block_stop; show a log line proving no intermediate JSON.parse was attempted and the final args object is complete.
  • Clicking Stop visibly halts tokens AND a server log / provider dashboard confirms the upstream generation aborted (token count stops climbing) — not just a frozen UI.
  • A before/after capture of the buffering proxy: with the fix, incremental delivery; with proxy_buffering on, the all-at-once flush after a long pause — plus a one-paragraph note naming the exact config that caused and cured it.
Senior stretch
  • Add server backpressure handling: when res.write() returns false, pause the upstream stream and resume on 'drain', and prove memory stays bounded against an artificially slow client (throttled consumer).
  • Add a heartbeat/keepalive comment frame (': ping') on an interval so idle-timeout proxies and load balancers don't kill a slow-to-first-token reasoning request.
  • Add a reasoning-model path: stream thinking/reasoning summaries (or a progress state) so a 10–60s TTFT never shows a bare frozen spinner, and measure the perceived-latency difference.
  • Add an on-call runbook page: the 'works on localhost, batches in prod' triage, the proxy/CDN/serverless buffering checklist, the empty-tool-args (lost input_json_delta) diagnosis, and the cancellation/idempotency contract.
Recap

This is the loop you will run on every real streaming feature: relay SSE correctly (right headers, buffer-and-split parsing, delta accumulation, tool args parsed only at content_block_stop), measure the latency that actually matters (TTFT and TPOT, not total time), cancel all the way to the upstream so a Stop button stops billing, recover from drops with a safe whole-turn retry, and verify against a real proxy because the number-one production failure — buffering that turns TTFT back into total time — never shows up on localhost. Build it once on a toy chat app and the production version becomes muscle memory.

Continue the climb ↑LLM cost budgets: token asymmetry, routing, and the kill switch
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.