Tool Use in AI Agents: Function Calling, Structured Output, and the Reliability Problem

The underlying capability that every agent pattern depends on. How tools are exposed, the two competing approaches, and why hallucinated tool calls are still an engineering concern in 2026.

Definition

Tool use (in LLM agents)

The mechanism by which a language model invokes an external capability during its reasoning process. A tool is any callable function: an API endpoint, a database query, a code execution sandbox, a file read, a vector-store retrieval. The model is given a name, a description, and a parameter schema for each tool; it decides which tool to call and what arguments to pass; the host application executes the call and returns the result.

Without tool use, a language model is limited to its training knowledge and the user’s prompt. With tool use, it can read live data, modify state, and chain operations across systems. Tool use is the single capability that distinguishes a reasoning model from an agent.

Two approaches

Structured function calling

The provider-API approach. The host application defines tools as JSON schemas; the model emits a structured response that contains the function name and the arguments as a JSON object; the host parses and executes. OpenAI introduced this in mid-2023; Anthropic and Google followed with their own implementations. The benefit is reliability: the API is responsible for producing valid JSON arguments matching the schema, and modern models do this with high accuracy.

Freeform ReAct-style tool use

The pre-API approach, still useful in cases where the structured API does not fit. The model is prompted to emit a specific text format (“Action: search[query=...]”); the host parses the format with regex or constrained decoding. Less reliable than structured function calling but more flexible: any text format the prompt describes is callable, including formats the API does not support.

In practice, most production agents in 2026 use structured function calling for the bulk of their tool inventory and freeform parsing for edge cases. The two are not mutually exclusive.

Tool definition anatomy

A tool exposed to a language model has four parts. Each contributes to the model’s ability to select and use the tool correctly.

Name. A short identifier (snake_case by convention). The model uses the name to refer to the tool.
Description. A one or two-sentence natural-language explanation of what the tool does and when to use it. The description has outsized influence on tool selection accuracy; vague descriptions produce vague selections.
Parameter schema. A JSON schema describing each argument: name, type, whether required, allowed values, and a per-argument description. Tight schemas reduce hallucinated arguments.
Optional examples. Some APIs (and most prompt designs) benefit from one or two example calls demonstrating canonical usage.

Anthropic’s tool-use documentation gives a particularly clear treatment of the description-quality argument: clear, specific tool descriptions improve tool-selection accuracy substantially compared to terse or generic descriptions.

The reliability problem

The defining engineering concern with tool use is hallucination: the model invents a tool that does not exist, calls a real tool with malformed arguments, or calls a tool when no tool was needed. Reliability has improved substantially across model generations (the published rate of well-formed function calls has risen on every major model family’s benchmark releases), but it has not been solved.

Three mitigations recur in production designs. Strict schema validation at the host: reject any call whose arguments do not match the schema, and feed the validation error back to the model with a retry instruction. Tool-use unit evals: a fixed set of representative tasks where the correct tool selection is known, run automatically on every prompt or model change. See the evaluation reference. Constrained decoding: at inference time, restrict the model’s output to valid JSON paths that match the schema, eliminating malformed-syntax errors at the source.

MCP (Model Context Protocol)

Anthropic introduced the Model Context Protocol in late 2024 as a standardised way to expose tools and data sources to language models. Rather than each provider defining its own function-calling shape and each tool author writing one integration per provider, MCP defines a common protocol: a tool implemented as an MCP server can be consumed by any MCP-aware client, across providers.

Adoption progressed through 2025 and 2026. The MCP specification is publicly maintained, and several major orchestration frameworks now support MCP servers as a first-class tool source. The conceptual contribution is decoupling: the model no longer needs to know how each tool is implemented, and the tool no longer needs to know which model is calling it.

Parallel tool calls

Modern frontier models routinely emit multiple tool calls in a single response. The host code executes them in parallel and returns all results in the next round. The architectural implication is meaningful: tasks that decompose into independent sub-queries (research a list of items, gather data from several APIs, run several validations concurrently) complete in roughly the time of the slowest call rather than the sum of all calls.

Parallel tool calling combines naturally with the multi-agent pattern: a supervisor agent can fan out work to several sub-agents in parallel, then aggregate. The parallelism is exposed at the tool-call level rather than requiring explicit threading in the host code.

Tool-use evaluation benchmarks

Evaluating tool-use reliability has become its own sub-discipline. The publicly tracked benchmarks include BFCL (Berkeley Function-Calling Leaderboard), tau-bench (multi-turn tool use, Sierra 2024), Nexus Function Calling Benchmark, and the tool-use subsets of AgentBench and SWE-bench. Each evaluates a different slice of the problem: simple tool selection, multi-turn tool use, parallel tool calling, error recovery. The companion site benchmarkingagents.com covers the benchmark landscape in depth.

Frequently asked questions

What is function calling in LLMs?

Function calling is the mechanism by which a language model invokes an external capability (an API, a database query, code execution, a file read) by emitting a structured JSON object that names the function and supplies its arguments. The host application parses the JSON, executes the function, and returns the result back to the model. OpenAI, Anthropic, and Google all expose this mechanism in their APIs with similar shapes and slightly different schemas.

What is the difference between structured function calling and ReAct-style tool use?

In structured function calling, the model emits a JSON object that the host code parses programmatically. Tool selection and argument formatting are constrained by the API; reliability is high. In ReAct-style tool use (the original Yao et al. 2022 paper), the model emits a natural-language Action trace; the host code parses it with regex or constrained decoding. Reliability is lower but the format is flexible enough to express tool calls the structured-calling API does not support.

What is MCP?

The Model Context Protocol, introduced by Anthropic in 2024, is a standardised protocol for exposing tools and data sources to language models across providers. Rather than each provider defining its own function-calling format, MCP defines a common shape that any compliant client can use. The aim is interoperability: a tool implemented as an MCP server can be used by any MCP-aware agent, regardless of which model is underneath.

Can a model call multiple tools at once?

Yes. Modern models support emitting multiple tool calls in a single response, enabling the host code to execute them in parallel. This dramatically reduces latency on parallelisable tasks (fan-out research, multi-source data gathering). The architectural implication is that an agent can issue, say, five concurrent search queries in one round-trip and aggregate the results in the next.

Related references

How AI Agents Work - tools as a core component
ReAct - the original ReAct-style tool use
Orchestration tooling - frameworks that wrap tool calls
Limitations - tool-call hallucination as a security concern

Sources and Further Reading

T. Schick et al., Toolformer: Language Models Can Teach Themselves to Use Tools, arXiv:2302.04761 (2023).
S. Yao et al., ReAct: Synergizing Reasoning and Acting, arXiv:2210.03629 (2022).
OpenAI, Function calling guide.
Anthropic, Tool use documentation.
Google, Gemini function-calling guide.
Anthropic, Model Context Protocol specification.
Berkeley Function-Calling Leaderboard, BFCL.
Sierra Research, tau-bench, arXiv:2406.12045 (2024).