AI Agent Observability: What It Takes to Debug Agents in Production
Most teams ship their first AI agent on optimism. It works in the demo, handles the happy path, and then meets real users. Three weeks later someone files a ticket: "The agent refunded a customer twice." Now you need to answer a deceptively hard question — why did it do that? — and your logs say 200 OK.
That gap is the problem AI agent observability exists to close. This guide covers what observability means specifically for agents (it's more than LLM logging), the signals worth capturing, and how to instrument a system you can actually debug under load.
Why agents are harder to observe than ordinary software
Traditional observability assumes deterministic code: same input, same path, same output. You trace a request, read the stack, and reproduce the bug. Agents break all three assumptions.
An agent run is a loop, not a function call. The model decides which tool to call, reads the result, and decides again — sometimes for dozens of steps. The control flow lives inside model outputs you didn't write. Two identical prompts can take different paths because of temperature, retrieved context, or a tool returning slightly different data. And the failure you care about is rarely a crash; it's a decision — a wrong tool, a misread result, a hallucinated argument.
So the unit of observability shifts. In a web service you trace a request. In an agent you have to trace a reasoning trajectory: the full sequence of model calls, tool invocations, inputs, outputs, and the state carried between them.
The signals that actually matter
Capturing "the prompt and the completion" is table stakes and not nearly enough. A useful agent trace records, for every step in the loop:
1. The decision context
What the model actually saw — system prompt, retrieved documents, conversation history, and the tool schemas available at that step. Most "the agent went rogue" incidents turn out to be "the agent saw something we didn't expect." You can't diagnose that without the exact input.
2. Tool calls and their results
Which tool was invoked, with what arguments, what it returned, and how long it took. Tool calls are where agents touch the real world — payments, emails, database writes — so they're where mistakes become expensive. Argument-level logging is non-negotiable here.
3. Token and cost accounting per step
Agents are loops, and loops have a habit of running longer than you planned. A run that quietly makes forty model calls instead of four won't error — it'll just cost ten times as much. Per-step token counts let you catch runaway loops, attribute spend to specific behaviors, and forecast unit economics before they surprise the finance team. (We go deeper on this in our piece on fast, cost-efficient inference.)
4. Latency at each hop
End-to-end latency hides where time goes. Step-level timing tells you whether you're waiting on the model, a slow tool, or your own orchestration code.
5. Replayability
The highest-value signal isn't a number — it's the ability to take a recorded run and replay it step by step. If you can reconstruct exactly what the agent saw and chose at each point, debugging goes from archaeology to a code review.
A simple instrumentation pattern
You don't need a heavyweight platform to start. The core idea is a structured trace object that travels with the run and records one entry per loop iteration. In pseudocode:
trace = Trace(run_id=uuid4())
while not done:
step = trace.start_step()
step.record_input(messages, available_tools)
response = model.complete(messages, tools=available_tools)
step.record_model_output(response, tokens=response.usage)
if response.tool_call:
result = run_tool(response.tool_call) # capture args + result
step.record_tool_call(response.tool_call, result)
messages.append(result)
step.end() # timing closes here
trace.finalize() # persist the whole trajectory, not just the last turn
Two rules make this pay off. First, persist the whole trajectory, not just the final answer — the last turn is almost never where the bug was introduced. Second, make traces queryable: "show me every run that called issue_refund more than once" should be a filter, not a grep through log files.
Build vs. adopt
You can assemble this from open-source tracing libraries and a data store, and many teams do. The trade-off is maintenance: schema changes, retention, access control, and correlating traces with spend all become your problem over time. The alternative is infrastructure that emits this trace by default.
This is the bet behind OBTO's Glass Receipt. Every agent run on the platform produces a complete, structured record — model calls, tool invocations, arguments, token counts, and cost — without extra instrumentation, and it's queryable and replayable out of the box. Because OBTO is open and self-hostable, the trace data stays in infrastructure you control; observability shouldn't require shipping every reasoning step to a vendor you can't audit. And because pricing is published rather than negotiated, the cost numbers in your traces line up with the numbers on our pricing page — no surprise multipliers.
Getting started
If you're standing up observability for the first time, a sane order of operations:
- Instrument the loop, not just the call. One trace entry per iteration.
- Log tool arguments and results — that's where real-world mistakes live.
- Track tokens and cost per step before you optimize anything.
- Make traces replayable so debugging is reproduction, not guesswork.
- Add alerts on behavioral signals — loop length, repeated sensitive tool calls, cost per run.
You can wire this up by hand, or start from a platform that captures it natively — OBTO's getting-started guide walks through the Glass Receipt on your first agent run, and the AI workforce overview shows how the same trace data supports auditing fleets of agents in production.
The agents you ship this year will make thousands of autonomous decisions on your behalf. The only way to trust them is to be able to see what they did — and the only time to build that visibility is before the double refund, not after.