AI / LLM Integration AI · 02 · 01

Tool calls: the round-trip loop, schema validation, and the guard against runaway agents

Tool calling turns a model into a function caller, but the model only emits a request — you execute it. Every call is a full extra round trip, the arguments can be hallucinated, and an unguarded loop will burn tokens forever.

AI Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

A support agent ships. A customer types “cancel my last order.” The model confidently emits a tool call: POST /orders/{id}/cancel with id: "ord_9f3c" — an id it never saw, fabricated to look plausible. The handler ran eval-style: it trusted the arguments and fired the request. Wrong customer’s order, cancelled. Worse, the next week a different bug had the model re-calling a failing lookup_order tool 40 times in one turn before anyone killed it — $9 of tokens for a question that had no answer. Both bugs share one root cause: the loop trusted the model.

The round-trip loop is a contract, not a function

Tool calling makes a model behave like a function you call, but the wiring is inverted: the model is the caller and your code is the callee. You declare the tools; the model decides when to invoke one and emits a structured request; your code runs it. The model never executes anything itself.

The canonical shape is a while loop keyed on the response’s stop_reason:

Send the request with your tools array and the user message.
The model responds with stop_reason: "tool_use" and one or more tool_use blocks (each has a tool name and a JSON arguments object).
Execute each tool. Format outputs as tool_result blocks.
Send a new request with the full history plus those tool_result blocks.
Repeat while stop_reason is still "tool_use". Exit on "end_turn" (final answer), "max_tokens", "stop_sequence", or "refusal".

Together these steps mean every tool invocation is a network round-trip that the model itself never sees or controls — the loop is yours to own. Miss step 4 and the model never learns the result; skip the exit condition and the loop never stops.

The load-bearing detail seniors internalize: step 4 is a brand-new model call. A three-tool task is four model invocations, each re-sending the whole growing transcript. This is why tool latency dominates — and why the loop guard below is not optional.

Tools are JSON-schema declarations

Each tool is a name, a description, and an input_schema — a JSON Schema object describing the arguments. The model reads the schema the same way a developer reads a function signature. The schema is doing real work: it both tells the model how to call the tool and gives you the contract to validate against before you execute.

Schemas are not free. A typical tool definition costs roughly 500 tokens; ten tools is ~5,000 tokens of overhead on every request in the loop, since the full tools array is re-sent each round. A 10-tool agent running a 6-step task pays that 5,000-token tax six times. This is the first place prompt caching earns its keep — caching the static tools block can cut input cost 40–80% and improve time-to-first-token.

`tool_choice`	Behavior	When a senior picks it
`auto`	Model decides: call a tool or answer in prose	Default for agents; the model judges if a tool is needed
`any`	Must call some tool, model picks which	When prose is never a valid answer (a router that must dispatch)
`tool` (forced)	Must call this exact named tool	Structured extraction: force one schema to get guaranteed-shape JSON
`none`	Forbid all tools this turn	Force a text summary after results are in

Parallel calls cut latency — but not every chain can use them

Modern models (Claude 4-class) will, when several independent tools are needed, emit multiple tool_use blocks in a single response. You run them concurrently and return all the tool_result blocks together. That collapses three sequential round trips into one — a real latency win, because each round trip is a fresh model call of hundreds of ms to seconds.

The catch is dependency. Parallelism only helps when the calls are independent (get_weather(NYC) and get_weather(SF)). A chain where call two needs the output of call one (find_user then cancel_user_order) is inherently serial and cannot be parallelized — the model has to see the first result before it can fill the second tool’s arguments. You can set disable_parallel_tool_use: true to force one tool per turn when your execution layer can’t safely run things concurrently.

▸Why this works

Server-executed tools (web search, code execution) run their own loop inside the provider and have a built-in iteration cap. When they hit it mid-task the response comes back with stop_reason: "pause_turn" rather than "end_turn" — you re-send the conversation to continue. Client tools have no such built-in cap; the guard is yours to write.

Validate arguments — never trust them

When you wire a mutating endpoint to a model, ask yourself: what happens if the model fabricates a plausible-looking id? The opening disaster — a fabricated ord_9f3c fed straight into a cancel endpoint — is the canonical failure: the handler treated model output as trusted input. The model emits plausible JSON, not correct JSON. It can hallucinate an id, invent an enum value the schema never listed, omit a required field, or pass a string where a number belongs.

The senior discipline is a hard gate before execution:

Schema-validate the arguments against the tool’s input_schema (a validator like Pydantic or jsonschema). Reject malformed shapes outright; never eval or blindly destructure them.
Authorize and existence-check the referenced entities. A well-formed id is still untrusted — confirm the order exists and belongs to this caller before acting on it.
On rejection, return a tool_result with an error, not an exception that breaks the loop. The model reads the error and can self-correct on the next turn — that feedback path is the whole point of returning structured tool errors.

Treat tool arguments exactly like any other untrusted user input crossing a trust boundary, because that is precisely what they are.

The max-iteration guard against runaway loops

Without a turn cap, a confused model loops forever: it calls lookup_order, gets an error, calls it again with the same arguments, gets the same error, and repeats. Each iteration is a full model call billing the entire accumulating transcript — costs and tokens climb with every step. This is a real outage and a real bill (one stuck turn quietly burned $9 of tokens before a human intervened).

Two guards, both mandatory in production:

A hard iteration cap — for step in range(MAX_STEPS) (often 8–15). Hit the cap, stop the loop, return a graceful failure to the user. Never while True.
Loop / repeat detection — if the model calls the same tool with the same arguments twice in a row, that is a stuck signal. Break, or inject a message telling it the call already failed so it stops repeating.

Pick the best fit

Your agent loop calls real mutating endpoints (cancel order, issue refund). How do you handle the arguments the model emits?

Quiz

A 5-step agent task uses tools at every step. Roughly how many model calls is that, and why does it matter?

Quiz

The model returns a tool_use for cancel_order with id 'ord_9f3c', an id never present in the conversation. What's the senior move?

Order the steps

Order one safe iteration of a client-side tool-use loop:

1 Send request with tools array; read stop_reason from the response
2 If stop_reason is tool_use, extract each tool_use block's name and arguments
3 Schema-validate the arguments, then authorize/existence-check referenced entities
4 Execute valid calls (parallel if independent); format outputs as tool_result blocks
5 Send a new request with the full history + tool_result blocks — under the max-iteration cap

The model only emits a tool_use request; your code validates and runs it, returns a tool_result, and the loop repeats while stop_reason stays tool_use — exiting on end_turn.

Recall before you leave

01
Walk through why an unguarded tool-use loop is both a correctness risk and a cost risk, and the two guards you add.
02
Why must you validate tool arguments, and what does 'validate' actually mean for a mutating endpoint like cancel_order?

Recap

Tool calling inverts the usual contract: the model is the caller and your code is the callee. The model emits a structured tool_use request with a tool name and JSON arguments; your code executes it and returns a tool_result, and the loop repeats while stop_reason stays "tool_use". Three things define senior-grade tool use. First, latency and cost: every tool round trip is a brand-new model call re-sending the whole growing transcript plus a ~500-tokens-per-tool schema array, so a 5-step task is ~6 model calls — prompt caching the static tools block is the standard mitigation. Second, validation: model arguments are untrusted input that can be hallucinated, so you schema-validate, then authorize and existence-check referenced entities, then execute, returning errors as tool_result so the model can self-correct — never eval a fabricated id into a mutating endpoint. Third, the guard: a hard iteration cap plus repeat-detection, because an unguarded loop will re-call a failing tool forever and burn real money. Use tool_choice (auto/any/tool/none) to control whether and which tool fires, and parallel tool calls to collapse independent round trips — but only when the calls don’t depend on each other. Now when you see a while True loop driving tool calls in a code review, you know exactly where the bill will go — and which two lines to add first.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Grounded RAG ServiceA RAG demo that answers from a corpus is easy; a RAG service you'd trust in front of users is not. The hard part isn't retrieval, it's grounding: making the model say only what the retrieved text supports, attaching citations the reader can check, and proving with an eval set that the answers don't drift into confident fiction. You'll build the whole loop — chunk, embed, store, retrieve top-k, ground, cite, score — and feel exactly where it leaks.