AI / LLM Integration
Tool calls: the round-trip loop, schema validation, and the guard against runaway agents
A support agent ships. A customer types “cancel my last order.” The model confidently emits a tool call: POST /orders/{id}/cancel with id: "ord_9f3c" — an id it never saw, fabricated to look plausible. The handler ran eval-style: it trusted the arguments and fired the request. Wrong customer’s order, cancelled. Worse, the next week a different bug had the model re-calling a failing lookup_order tool 40 times in one turn before anyone killed it — $9 of tokens for a question that had no answer. Both bugs share one root cause: the loop trusted the model.
The round-trip loop is a contract, not a function
Tool calling makes a model behave like a function you call, but the wiring is inverted: the model is the caller and your code is the callee. You declare the tools; the model decides when to invoke one and emits a structured request; your code runs it. The model never executes anything itself.
The canonical shape is a while loop keyed on the response’s stop_reason:
- Send the request with your
toolsarray and the user message. - The model responds with
stop_reason: "tool_use"and one or moretool_useblocks (each has a tool name and a JSON arguments object). - Execute each tool. Format outputs as
tool_resultblocks. - Send a new request with the full history plus those
tool_resultblocks. - Repeat while
stop_reasonis still"tool_use". Exit on"end_turn"(final answer),"max_tokens","stop_sequence", or"refusal".
The load-bearing detail seniors internalize: step 4 is a brand-new model call. A three-tool task is four model invocations, each re-sending the whole growing transcript. This is why tool latency dominates — and why the loop guard below is not optional.
Tools are JSON-schema declarations
Each tool is a name, a description, and an input_schema — a JSON Schema object describing the arguments. The model reads the schema the same way a developer reads a function signature. The schema is doing real work: it both tells the model how to call the tool and gives you the contract to validate against before you execute.
Schemas are not free. A typical tool definition costs roughly 500 tokens; ten tools is ~5,000 tokens of overhead on every request in the loop, since the full tools array is re-sent each round. A 10-tool agent running a 6-step task pays that 5,000-token tax six times. This is the first place prompt caching earns its keep — caching the static tools block can cut input cost 40–80% and improve time-to-first-token.
tool_choice | Behavior | When a senior picks it |
|---|---|---|
auto | Model decides: call a tool or answer in prose | Default for agents; the model judges if a tool is needed |
any | Must call some tool, model picks which | When prose is never a valid answer (a router that must dispatch) |
tool (forced) | Must call this exact named tool | Structured extraction: force one schema to get guaranteed-shape JSON |
none | Forbid all tools this turn | Force a text summary after results are in |
Parallel calls cut latency — but not every chain can use them
Modern models (Claude 4-class) will, when several independent tools are needed, emit multiple tool_use blocks in a single response. You run them concurrently and return all the tool_result blocks together. That collapses three sequential round trips into one — a real latency win, because each round trip is a fresh model call of hundreds of ms to seconds.
The catch is dependency. Parallelism only helps when the calls are independent (get_weather(NYC) and get_weather(SF)). A chain where call two needs the output of call one (find_user then cancel_user_order) is inherently serial and cannot be parallelized — the model has to see the first result before it can fill the second tool’s arguments. You can set disable_parallel_tool_use: true to force one tool per turn when your execution layer can’t safely run things concurrently.
Why this works
Server-executed tools (web search, code execution) run their own loop inside the provider and have a built-in iteration cap. When they hit it mid-task the response comes back with stop_reason: "pause_turn" rather than "end_turn" — you re-send the conversation to continue. Client tools have no such built-in cap; the guard is yours to write.
Validate arguments — never trust them
The model emits plausible JSON, not correct JSON. It can hallucinate an id, invent an enum value the schema never listed, omit a required field, or pass a string where a number belongs. The opening disaster — a fabricated ord_9f3c fed straight into a cancel endpoint — is the canonical failure: the handler treated model output as trusted input.
The senior discipline is a hard gate before execution:
- Schema-validate the arguments against the tool’s
input_schema(a validator like Pydantic orjsonschema). Reject malformed shapes outright; neverevalor blindly destructure them. - Authorize and existence-check the referenced entities. A well-formed
idis still untrusted — confirm the order exists and belongs to this caller before acting on it. - On rejection, return a
tool_resultwith an error, not an exception that breaks the loop. The model reads the error and can self-correct on the next turn — that feedback path is the whole point of returning structured tool errors.
Treat tool arguments exactly like any other untrusted user input crossing a trust boundary, because that is precisely what they are.
The max-iteration guard against runaway loops
Without a turn cap, a confused model loops forever: it calls lookup_order, gets an error, calls it again with the same arguments, gets the same error, and repeats. Each iteration is a full model call billing the entire accumulating transcript — costs and tokens climb with every step. This is a real outage and a real bill (one stuck turn quietly burned $9 of tokens before a human intervened).
Two guards, both mandatory in production:
- A hard iteration cap —
for step in range(MAX_STEPS)(often 8–15). Hit the cap, stop the loop, return a graceful failure to the user. Neverwhile True. - Loop / repeat detection — if the model calls the same tool with the same arguments twice in a row, that is a stuck signal. Break, or inject a message telling it the call already failed so it stops repeating.
Your agent loop calls real mutating endpoints (cancel order, issue refund). How do you handle the arguments the model emits?
A 5-step agent task uses tools at every step. Roughly how many model calls is that, and why does it matter?
The model returns a tool_use for cancel_order with id 'ord_9f3c', an id never present in the conversation. What's the senior move?
Order one safe iteration of a client-side tool-use loop:
- 1 Send request with tools array; read stop_reason from the response
- 2 If stop_reason is tool_use, extract each tool_use block's name and arguments
- 3 Schema-validate the arguments, then authorize/existence-check referenced entities
- 4 Execute valid calls (parallel if independent); format outputs as tool_result blocks
- 5 Send a new request with the full history + tool_result blocks — under the max-iteration cap
- 01Walk through why an unguarded tool-use loop is both a correctness risk and a cost risk, and the two guards you add.
- 02Why must you validate tool arguments, and what does 'validate' actually mean for a mutating endpoint like cancel_order?
Tool calling inverts the usual contract: the model is the caller and your code is the callee. The model emits a structured tool_use request with a tool name and JSON arguments; your code executes it and returns a tool_result, and the loop repeats while stop_reason stays "tool_use". Three things define senior-grade tool use. First, latency and cost: every tool round trip is a brand-new model call re-sending the whole growing transcript plus a ~500-tokens-per-tool schema array, so a 5-step task is ~6 model calls — prompt caching the static tools block is the standard mitigation. Second, validation: model arguments are untrusted input that can be hallucinated, so you schema-validate, then authorize and existence-check referenced entities, then execute, returning errors as tool_result so the model can self-correct — never eval a fabricated id into a mutating endpoint. Third, the guard: a hard iteration cap plus repeat-detection, because an unguarded loop will re-call a failing tool forever and burn real money. Use tool_choice (auto/any/tool/none) to control whether and which tool fires, and parallel tool calls to collapse independent round trips — but only when the calls don’t depend on each other.