AI / LLM Integration
Streaming LLM responses: SSE, partial tokens, and the proxy that eats them
The demo was flawless on a laptop: type a prompt, words appear instantly, the cursor races across the screen. Then it shipped behind the company’s nginx. Now every user stares at a spinner for nine seconds, then the whole answer slams in at once. Nothing in the app changed. The model still streamed token by token — but a proxy with proxy_buffering on quietly held the entire response in a buffer and flushed it only at the end. The streaming was real; the user never saw a byte of it.
Why stream at all: TTFT beats total time
A non-streamed completion makes the user wait for the entire generation. A 400-token answer at 60 tokens/sec is roughly seven seconds of blank screen, then everything at once. Streaming changes nothing about that total — it still takes seven seconds to finish — but it changes the only number the user feels: time-to-first-token (TTFT), the gap between sending the request and the first visible word.
For chat-style apps, a TTFT under ~1 second feels instant; production systems on hot paths target 200–500 ms. Once tokens start flowing, time-per-output-token (TPOT) governs the read: 50 tok/s feels sluggish, 200 tok/s (~150 words/sec) reads faster than anyone can. Streaming buys roughly a 10–20x improvement in perceived responsiveness for zero change in actual compute. The senior framing: you are not making it faster, you are paying down latency with the user’s own reading speed.
Two failure shapes hide behind this. Reasoning models with chain-of-thought can sit at 10–150 seconds of TTFT before any token — there is nothing to stream while it thinks, so a naive UI looks frozen. And if your transport buffers (next section), TTFT collapses back to total time and you have the worst of both worlds: streaming complexity, batch latency.
The SSE event protocol
Every major provider streams over Server-Sent Events (SSE) — a long-lived HTTP response with Content-Type: text/event-stream, where the server pushes newline-delimited data: frames over one connection. It is one-directional (server→client), survives ordinary HTTP infrastructure, and auto-reconnects in the browser EventSource — which is exactly why it won over WebSockets for this job.
The frames are a typed lifecycle, not raw text. Anthropic’s sequence: message_start (envelope, empty content) → one or more content blocks, each content_block_start → many content_block_delta → content_block_stop → then message_delta (carries stop reason and final token usage) → message_stop. OpenAI’s shape differs in names but follows the same idea: a start, a stream of deltas, a terminal event. You never get the answer in one frame; you accumulate deltas into a snapshot.
| SSE event | Carries | What you do |
|---|---|---|
message_start | Message id, role, empty content | Open the assistant bubble; start a TTFT timer |
content_block_start | Block index + type (text / tool_use) | Decide: render as text, or buffer as tool args |
content_block_delta | text_delta or input_json_delta | Append the chunk to your accumulator |
content_block_stop | Block index | Now the block is complete — safe to parse tool JSON |
message_delta / message_stop | Stop reason, final usage | Finalize; record tokens for billing/metrics |
Accumulating partial tokens — and the partial-JSON trap
Text is forgiving: each text_delta is a valid string fragment, so you append and render immediately. The accumulator is just text += delta and the UI paints as it goes.
Tool calls are not forgiving. When the model calls a function, the arguments stream as input_json_delta chunks — and each chunk is a fragment of a JSON document, not valid JSON on its own. A delta might be {"city": "San Fran, then cisco", "unit":, then "celsius"}. If you call JSON.parse on any intermediate buffer, you throw on a syntax error. The rule is hard: accumulate every input_json_delta into a string buffer and parse only after content_block_stop. You can show the partial string as a “calling tool…” indicator, but you cannot act on the arguments until the block closes.
This is a real and recurring production bug. Proxy and adapter layers (LiteLLM has shipped several such regressions) sometimes drop or mishandle input_json_delta frames — the tool call arrives with input as an empty {} and your function runs with no arguments. If a tool call mysteriously fires with empty args, suspect a layer between you and the model swallowing the JSON deltas, not your own parser.
Why this works
Why ship JSON args as un-parseable fragments at all? Because the model generates them token by token like any other text — there is no point at which the provider has “the whole arguments object” early. Streaming them lets your UI show progress and lets the provider start sending the moment the first token exists. The contract pushes the one-time parse cost to the block boundary, where the JSON is guaranteed complete.
Reconnect, resume, and the buffering proxy that kills it all
SSE is built to reconnect: the server can send an id: with each event, and on a dropped connection the browser reissues the request with a Last-Event-ID header so the server can resume after the last delivered event. In practice, mid-generation LLM resume is rare to implement (you would need to replay buffered deltas server-side); most apps treat a dropped stream as “retry the whole turn” and rely on idempotency. The honest senior default is: design for the stream dying at token 200 of 400, and make a fresh request safe to repeat.
But the failure that bites hardest needs no dropped connection at all — it is buffering in the path. nginx ships with proxy_buffering on by default: it reads the upstream response into a buffer and forwards it to the client only when the buffer fills or the response ends. For a normal page that is an optimization; for SSE it is fatal — the client receives nothing until the model finishes, so TTFT becomes total time and your spinner hangs for the full generation. The same trap appears in serverless platforms and CDNs that buffer or impose response-size/time limits, and in any gzip layer that waits for enough bytes to compress.
The fixes are specific. On nginx: proxy_buffering off, proxy_http_version 1.1, clear Connection, and a long proxy_read_timeout. When you can’t touch the proxy config, set the response header X-Accel-Buffering: no (nginx honors it per-response) plus Cache-Control: no-cache. The diagnostic that saves hours: if streaming works against localhost but flushes all-at-once in staging, stop debugging your code — something in the network path is buffering.
Your chat app streams fine locally but in production every response appears all at once after a long pause. Pick the first fix to investigate.
A tool call streams its arguments as input_json_delta chunks. When is it safe to JSON.parse the accumulated buffer?
Why does streaming improve UX even though it doesn't reduce total generation time?
Order the SSE lifecycle for a single streamed message:
- 1 message_start — envelope arrives with empty content; start the TTFT timer
- 2 content_block_start — a text or tool_use block opens
- 3 content_block_delta (×N) — append each text_delta or input_json_delta to the accumulator
- 4 content_block_stop — block complete; now safe to JSON.parse tool args
- 5 message_delta then message_stop — stop reason + final usage; finalize
- 01An app streams perfectly on localhost but in production every reply appears all at once after a long pause. Walk through the diagnosis and fix.
- 02Why must tool-call arguments be accumulated before parsing, and what bug appears when a middle layer mishandles those deltas?
Streaming does not make generation faster — it pays down latency with the user’s reading speed by collapsing time-to-first-token from the full seven-second generation to a few hundred milliseconds, while time-per-output-token governs the read once tokens flow. The transport is SSE: a long-lived text/event-stream where the server pushes a typed lifecycle — message_start, content_block_start, a run of content_block_delta, content_block_stop, then message_delta and message_stop — and you accumulate deltas into a snapshot rather than ever receiving the answer whole. Text deltas are safe to render immediately; tool-call arguments arrive as input_json_delta fragments that are not valid JSON until the block closes, so you buffer and parse only at content_block_stop, and an empty-args tool call usually means a middle layer ate those deltas. SSE can reconnect via Last-Event-ID, but most apps treat a dropped stream as a repeatable whole-turn retry. The failure that erases every gain needs no dropped connection: a buffering reverse proxy or serverless layer (nginx proxy_buffering on by default) holds the whole response and flushes at the end, turning TTFT back into total time. The tell is “works on localhost, batches in prod,” and the fix lives in the path config — proxy_buffering off or X-Accel-Buffering: no, with compression disabled on the stream route.