AI / LLM Integration
Tool calls: build a robust tool-calling loop
Reading about hallucinated ids and runaway loops is not the same as building a loop that survives them. Wire a real tool-calling agent against tools that lie, hang, and fail — then prove it stays correct, bounded, and cheap with evidence at every step.
Turn the unit’s mental model into a reusable harness: a tool-use loop that schema-validates and authorizes every argument, time-boxes and retries each tool, caps iterations and detects repeats, and returns every failure as a tool_result the model can recover from.
Build a production-grade tool-calling loop around any chat model with tool use (Claude, or an OpenAI-compatible function-calling API). It must execute a multi-step task against at least three tools — including one mutating, authorization-gated tool — and stay correct, bounded, and cost-aware when tools return hallucinated arguments, hang, or fail.
- A test where the model is fed (or prompted into) a hallucinated order_id: the loop rejects it at the authorization/existence check, returns a tool_result error, and the model recovers — the mutating endpoint never fires on the bad id. Show the logs.
- A test where one tool hangs: the per-tool timeout fires, returns a tool_result error, and the loop continues or exits gracefully — the turn never stalls indefinitely.
- A test where the model gets stuck repeating one failing call: the iteration cap and/or repeat-detection halts it within MAX_STEPS and returns a graceful failure to the user — captured cost stays bounded.
- A before/after latency measurement for two independent calls showing the parallel path is faster than sequential, with the tool_result ids correctly matched.
- A short write-up: the three validation layers, where each guard lives, the retry policy (and why the mutating tool is excluded), and the per-turn token/call numbers.
- Add an on-call runbook: how to read the loop's logs (model calls per turn, tokens, which guard fired), the top failure modes (hallucinated arg, hang, runaway), and the fix for each.
- Add structured-output mode: force a single named tool (tool_choice: tool) to extract guaranteed-shape JSON, and contrast its reliability with parsing free-form prose.
- Add prompt caching on the tools block and measure the actual input-cost and time-to-first-token reduction across a multi-step run.
- Add a budget guard that aborts the turn when cumulative tokens or model calls exceed a per-request ceiling, returning a graceful partial result — defense against an allocation-style token DoS.
This is the loop you will run in every real agent: drive it on stop_reason, treat every tool argument as untrusted input (schema-validate, then authorize and existence-check before any mutation), return all failures as tool_result so the model self-corrects, time-box and selectively retry tools, and bound the whole thing with an iteration cap and repeat-detection. Parallelize independent calls, cache the static tools block, and log calls and tokens. Building it once against tools that lie, hang, and fail makes the production version muscle memory.