AI / LLM Integration AI · 02 · 10

Tool calls: build a robust tool-calling loop

Hands-on project — build a robust tool-calling loop with schema validation, authorization, per-tool timeouts, bounded retries, and a runaway guard, then prove it survives hostile and failing tools.

AI Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about hallucinated ids and runaway loops is not the same as building a loop that survives them. Wire a real tool-calling agent against tools that lie, hang, and fail — then prove it stays correct, bounded, and cheap with evidence at every step.

Goal

Turn the unit’s mental model into a reusable harness: a tool-use loop that schema-validates and authorizes every argument, time-boxes and retries each tool, caps iterations and detects repeats, and returns every failure as a tool_result the model can recover from.

Project

0 of 8

Objective

Build a production-grade tool-calling loop around any chat model with tool use (Claude, or an OpenAI-compatible function-calling API). It must execute a multi-step task against at least three tools — including one mutating, authorization-gated tool — and stay correct, bounded, and cost-aware when tools return hallucinated arguments, hang, or fail.

Requirements

Acceptance criteria

A test where the model is fed (or prompted into) a hallucinated order_id: the loop rejects it at the authorization/existence check, returns a tool_result error, and the model recovers — the mutating endpoint never fires on the bad id. Show the logs.
A test where one tool hangs: the per-tool timeout fires, returns a tool_result error, and the loop continues or exits gracefully — the turn never stalls indefinitely.
A test where the model gets stuck repeating one failing call: the iteration cap and/or repeat-detection halts it within MAX_STEPS and returns a graceful failure to the user — captured cost stays bounded.
A before/after latency measurement for two independent calls showing the parallel path is faster than sequential, with the tool_result ids correctly matched.
A short write-up: the three validation layers, where each guard lives, the retry policy (and why the mutating tool is excluded), and the per-turn token/call numbers.

Senior stretch

Add an on-call runbook: how to read the loop's logs (model calls per turn, tokens, which guard fired), the top failure modes (hallucinated arg, hang, runaway), and the fix for each.
Add structured-output mode: force a single named tool (tool_choice: tool) to extract guaranteed-shape JSON, and contrast its reliability with parsing free-form prose.
Add prompt caching on the tools block and measure the actual input-cost and time-to-first-token reduction across a multi-step run.
Add a budget guard that aborts the turn when cumulative tokens or model calls exceed a per-request ceiling, returning a graceful partial result — defense against an allocation-style token DoS.

Recap

This is the loop you will run in every real agent: drive it on stop_reason, treat every tool argument as untrusted input (schema-validate, then authorize and existence-check before any mutation), return all failures as tool_result so the model self-corrects, time-box and selectively retry tools, and bound the whole thing with an iteration cap and repeat-detection. Parallelize independent calls, cache the static tools block, and log calls and tokens. Building it once against tools that lie, hang, and fail makes the production version muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.