III.2. Tooling / Observability

Last verified April 2026 - 9 sources

Observability for AI Agents

A category reference. What observability for agents covers, how it differs from traditional APM, and the major tools available as of 2026.

What observability provides

For agents, observability is a stack of capabilities, not a single product. Six features recur across the major tools.

Distributed tracing. Every LLM call, tool call, and sub-agent hop becomes a span in a trace. The full trace tree shows what happened in what order, with timing.
Prompt and completion logging. The full text of every model input and output, stored for inspection and replay.
Cost tracking per trace. Tokens in and out, multiplied by per-model rates, attributed back to the user, agent role, or task type.
Latency breakdown. How much time was spent in the model, in tool calls, in framework overhead.
Error clustering. Failed traces grouped by error type, surfaced for triage.
Regression detection. Comparison of trace quality before and after a prompt or model change, often via online evaluation.

How agent observability differs from application observability

Traditional APM tools (Datadog, New Relic, Honeycomb) track function calls, HTTP requests, and database queries. Their data model assumes spans are short, structured, and primarily numeric. Agent observability adds spans that carry multi-kilobyte text bodies (the prompt, the completion), arguments and results from tool calls, and a reasoning trace that may span many seconds across many model calls.

The major APM vendors have begun adding LLM-specific support. Specialised tools still cover the agent-specific data model more thoroughly, particularly for prompt management, evaluation integration, and dataset collection from production traces.

Category vocabulary

OpenTelemetry-compatible vs proprietary. Whether traces emit in the OTel format. OTel-compatible tools avoid lock-in on the storage backend.
Dataset collection. The ability to sample traces from production for later use in evaluation, fine-tuning, or prompt iteration.
Online-eval hooks. Running an automatic evaluator (LLM-as-judge, deterministic check) on a sampled subset of live traces, surfacing quality drift.
Sampling. Recording all spans, a percentage, or only spans matching specific criteria. Volume management at scale.
Data retention. How long full prompt and completion bodies are stored. Constrained by data-protection requirements as much as by storage cost.
PII redaction. Automated or rules-based scrubbing of personal data from prompts and completions before storage.

The major tools

A reference list of tools that practitioners use as of 2026. Documentation links provided; this page does not rank them.

Open source

Langfuse

Open-source core with managed cloud option. Trace viewer, prompt management, dataset collection.

documentation

Commercial (LangChain)

LangSmith

LangChain ecosystem. Tight integration with LangGraph; commercial.

documentation

Open source / commercial

Arize Phoenix

Phoenix is open-source observability for LLM apps. Arize is the commercial platform.

documentation

Commercial

Braintrust

Eval-first observability platform. Datasets, experiments, traces in one product.

documentation

Commercial

Humanloop

Prompt management, evaluation, and observability. Strong human-feedback collection.

documentation

Commercial

HoneyHive

Observability and evaluation platform. Strong on PII redaction and compliance features.

documentation

Open source / commercial

Helicone

Open-source core with proxy-based capture. Easy bring-up by routing API traffic.

documentation

Commercial

Weights & Biases Weave

Part of the W&B platform. Suited to teams already using W&B for ML.

documentation

Self-host vs hosted

The decision turns on three factors. Data sensitivity: prompts and completions can contain PII, trade secrets, or regulated data that cannot leave your infrastructure. Operational overhead: running an observability stack adds engineering work that small teams may not want to absorb. Cost at scale: managed services price by trace volume, which becomes significant once traffic grows.

A common pattern is starting with a managed service for speed and migrating to a self-hosted instance of an open-source tool (Langfuse, Phoenix) once one of those factors becomes binding. Both Langfuse and Phoenix offer self-hostable open-source cores; both also offer managed cloud versions for teams that do not want to operate them.

OpenTelemetry convergence

The OpenTelemetry semantic conventions for GenAI tracing were drafted through 2024 and stabilised through 2025-2026. They define standard span attributes for LLM calls: model name, input tokens, output tokens, prompt content, response content, finish reason. Adoption across observability tools is uneven but converging; OpenTelemetry-compatible tools are the safest bet for avoiding lock-in on trace storage.

Frequently asked questions

How is agent observability different from APM tools like Datadog?

Traditional application performance monitoring tracks HTTP calls, database queries, and infrastructure metrics. Agent observability adds prompts, completions, tool-call arguments and results, and the reasoning trace. The data shapes are fundamentally different: a span in APM is a function call; a span in agent observability often carries multi-kilobyte prompt and completion bodies. APM tools have begun adding LLM support, but specialised tools cover the agent-specific data model more thoroughly.

Should I self-host or use a managed observability service?

Three factors: data sensitivity (some prompts and completions contain PII or trade secrets that cannot leave your infrastructure), operational overhead (self-hosting an observability stack is its own engineering project), and cost at scale (managed services price by trace volume, which can become significant). Many teams start managed for speed and migrate to self-hosted Langfuse or Phoenix when one of these factors becomes binding.

What is OpenTelemetry GenAI?

OpenTelemetry is the open standard for distributed tracing across software systems. The OpenTelemetry GenAI semantic conventions, finalised through 2024-2025, define standard span attributes for LLM calls (model name, input tokens, output tokens, prompt content, response content, finish reason). Adoption is uneven across vendors but converging; tools that emit OpenTelemetry-compatible traces avoid vendor lock-in for the underlying trace storage.

What metrics are most useful in production?

Four categories cover most production debugging: cost per task (tokens used end-to-end, with attribution to user, agent role, or task type), latency breakdown (model time vs tool time vs framework overhead), error rates by type (tool-call hallucination, schema validation failure, timeout), and quality metrics on a sampled subset (LLM-as-judge or human review). The first two are easy to instrument; the latter two require a feedback loop with the evaluation system.

Related references

Orchestration tooling - frameworks integrate observability hooks
Evaluation tooling - paired with observability in production
Honest Limitations - why production agents need observability
How to Build - observability from day one

Sources and Further Reading

OpenTelemetry, GenAI semantic conventions.
Langfuse, documentation.
LangSmith, documentation.
Arize Phoenix, documentation.
Braintrust, documentation.
Humanloop, documentation.
HoneyHive, documentation.
Helicone, documentation.
Weights & Biases Weave, documentation.