AI / LLM Integration AI · 04 · 07

Streaming: multiple-choice review

Multiple-choice synthesis across the streaming unit — TTFT vs total time, the SSE event lifecycle, delta accumulation, partial-JSON tool args, reconnect, and the buffering proxy.

AI Senior ◷ 13 min

Level

FoundationsJuniorMiddleSenior

Six questions that cut across the whole unit. Each one mirrors a call you make in a real streaming incident — not a definition to recite, but a tradeoff to weigh when the spinner hangs in production.

Goal

Confirm you can connect the latency model, the SSE lifecycle, delta accumulation, the partial-JSON contract, and the buffering failure mode — the synthesis the lesson built toward.

Quiz

A 400-token answer generates at 60 tok/s. Product asks you to 'make the response twice as fast' by enabling streaming. What do you tell them, in senior terms?

Quiz

Your chat app streams flawlessly against localhost but in staging every reply appears all at once after a long pause. Where do you look first?

Quiz

The model calls a tool; its arguments stream as input_json_delta chunks: '{"city": "San Fran', then 'cisco", "unit":', then ' "celsius"}'. When may you JSON.parse the buffer?

Quiz

A tool call fires in production with input = {} (empty args), but the same prompt works locally against the provider directly. Most likely cause?

Quiz

A connection drops at token 200 of a 400-token generation. What is the honest senior default for handling it?

Quiz

A reasoning model with chain-of-thought sits at 30s TTFT before emitting any token; users report the app is 'frozen.' What is the correct read and fix?

Recap

The unit’s through-line is one model: streaming pays down perceived latency with the user’s reading speed by collapsing TTFT, while total time and token count are unchanged. SSE delivers a typed lifecycle (message_start → content_block_start → content_block_delta ×N → content_block_stop → message_delta/stop) that you accumulate; text deltas render immediately but tool-call input_json_delta fragments are parseable only at content_block_stop. Empty-args tool calls mean a middle layer ate the deltas; dropped streams are repeatable whole-turn retries; and the failure that erases every gain — needing no dropped connection — is a buffering proxy turning TTFT back into total time. Now when you see “works on localhost, batches in prod,” you know which layer to blame and which config line to change first.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.