AI / LLM Integration
Streaming: multiple-choice review
Six questions that cut across the whole unit. Each one mirrors a call you make in a real streaming incident — not a definition to recite, but a tradeoff to weigh when the spinner hangs in production.
Confirm you can connect the latency model, the SSE lifecycle, delta accumulation, the partial-JSON contract, and the buffering failure mode — the synthesis the lesson built toward.
A 400-token answer generates at 60 tok/s. Product asks you to 'make the response twice as fast' by enabling streaming. What do you tell them, in senior terms?
Your chat app streams flawlessly against localhost but in staging every reply appears all at once after a long pause. Where do you look first?
The model calls a tool; its arguments stream as input_json_delta chunks: '{"city": "San Fran', then 'cisco", "unit":', then ' "celsius"}'. When may you JSON.parse the buffer?
A tool call fires in production with input = {} (empty args), but the same prompt works locally against the provider directly. Most likely cause?
A connection drops at token 200 of a 400-token generation. What is the honest senior default for handling it?
A reasoning model with chain-of-thought sits at 30s TTFT before emitting any token; users report the app is 'frozen.' What is the correct read and fix?
The unit’s through-line is one model: streaming pays down perceived latency with the user’s reading speed by collapsing TTFT, while total time and token count are unchanged. SSE delivers a typed lifecycle (message_start → content_block_start → content_block_delta ×N → content_block_stop → message_delta/stop) that you accumulate; text deltas render immediately but tool-call input_json_delta fragments are parseable only at content_block_stop. Empty-args tool calls mean a middle layer ate the deltas; dropped streams are repeatable whole-turn retries; and the failure that erases every gain — needing no dropped connection — is a buffering proxy turning TTFT back into total time.