AI / LLM Integration
Streaming: free-recall review
Retrieval beats re-reading. For each prompt, say or write a full answer from memory before you open the model answer — the effort of recall is what makes the streaming mental model stick.
Reconstruct the unit’s core mechanisms — the TTFT latency model, the SSE lifecycle, delta accumulation, the partial-JSON tool contract, reconnect strategy, and the buffering failure mode — without looking back at the lesson.
- 01Why does streaming improve UX when it does not reduce total generation time? Name the two latency metrics it trades on.
- 02Walk through the SSE event lifecycle for one streamed message and say what you do at each stage.
- 03Why must tool-call arguments be accumulated before parsing, and what production bug appears when a middle layer mishandles those deltas?
- 04A stream drops at token 200 of 400. Compare full Last-Event-ID resume against the pragmatic default, and say which you'd ship.
- 05Describe the number-one production failure for streaming, its signature, and the concrete fixes.
- 06Why can a reasoning model with chain-of-thought make a streaming UI look frozen, and how do you handle it?
If you could reconstruct each answer from memory, you hold the unit’s spine: streaming trades total time for TTFT and is read at TPOT; SSE delivers a typed lifecycle you accumulate into a snapshot; text deltas render immediately but tool-arg input_json_delta fragments parse only at content_block_stop, and empty args mean a middle layer ate the deltas; dropped streams are repeatable whole-turn retries by default; reasoning-model TTFT needs a UX progress state, not a transport fix; and the number-one production killer is a buffering proxy turning TTFT back into total time, fixed in the path config, never the app.