← All articles Engineering

AI Agent Cost Tracking: Know What Every Run Costs

OBTO Team · Insights from the Glass Box

Ask most teams what their AI agents cost and you'll get a monthly invoice total. Ask what a single agent task costs — one resolved ticket, one cleaned spreadsheet, one research report — and you'll usually get silence. That gap is the difference between running agents as an experiment and running them as a business function.

This guide covers how agent costs actually accumulate, why per-task attribution is harder than it looks, and how to build cost tracking that lets you answer "should we automate this?" with a number.

Why agent costs are different from chatbot costs

A chatbot conversation is roughly linear: one user message, one model response. An agent task is a loop. The agent reasons, calls a tool, reads the result, reasons again — often ten to fifty times per task. Three properties make this expensive in non-obvious ways.

Context accumulates. Every loop iteration re-sends the growing conversation history. A task that ends with 40,000 tokens of context didn't cost 40,000 input tokens — it cost the sum of every intermediate context window, which can be five to ten times larger. Input tokens, not output tokens, dominate most agent bills.

Tool results are token-heavy. A single database query returning 200 rows can inject more tokens into context than the entire rest of the conversation. Verbose tool outputs are the most common silent cost driver — one reason tool design matters, as covered in our guide to building MCP tools.

Failure costs the same as success. An agent that retries a broken tool five times and then gives up consumed real tokens. Without per-run tracking, failed runs are invisible in the invoice but very visible in the total.

The unit that matters: cost per completed task

Token prices per million are public; they're also nearly useless for budgeting on their own. The number that drives decisions is cost per completed task, which combines several streams: LLM tokens (input, output, and cache reads — priced differently), tool execution compute, downstream API fees incurred by tools, and retries or fallback runs.

Divide by successful completions, not total runs. A workflow that costs $0.04 per run but succeeds 60% of the time has a real unit cost of $0.067 — and the gap between those two numbers is where budgets quietly die.

Once you have cost per task, comparisons become possible: against the human-hours equivalent, against a simpler non-agentic script, or against the same workflow on a cheaper model. Fast, low-cost inference changes this math materially — we saw it firsthand when we added Groq support, where shifting routine steps to high-speed open-weight models cut per-task costs without hurting quality on most workflows.

What cost attribution actually requires

To attribute spend, every model call and tool call needs to carry metadata — which agent, which workflow, which run, which step — and you need to aggregate it at each level.

Per-call metering

Capture token counts and latency for every LLM call as it happens, not reconstructed from the provider invoice three weeks later. Provider invoices aggregate at the API-key level, which tells you nothing about which agent spent the money.

Run-level tracing

Group those calls into runs, so you can see one task end-to-end: every reasoning step, every tool invocation, and its cumulative cost. Cost tracking and observability are the same plumbing viewed from different angles — the trace that explains why an agent did something is the same trace that explains what it cost, as we argued in our agent observability guide.

Workflow-level rollups

Aggregate runs by workflow and time period to spot trends: a prompt change that doubled context size, a new tool that returns bloated payloads, a model version that reasons longer.

Anomaly visibility

The most expensive agent runs are usually pathological — a loop that didn't terminate, a retry storm. You want these surfaced, not averaged away.

A useful per-run ledger looks something like this:

{
  "run_id": "run_8f3a",
  "workflow": "ticket-triage",
  "status": "completed",
  "steps": 14,
  "tokens": { "input": 182400, "cached": 121000, "output": 9300 },
  "tool_calls": [
    { "tool": "cmdb_lookup", "ms": 410, "result_tokens": 2900 },
    { "tool": "update_ticket", "ms": 220, "result_tokens": 140 }
  ],
  "cost": { "llm": 0.0291, "tools": 0.0040, "total": 0.0331 }
}

This is precisely what OBTO's Glass Receipt provides: a per-run, per-step ledger of every model call, tool call, and token, queryable in real time. It exists because we couldn't run our own agents responsibly without it.

Five cost levers, in order of impact

Trim tool outputs. Return only the fields the agent needs. This is routinely a 30–60% context reduction for data-heavy workflows.
Use prompt caching. Stable system prompts and tool definitions should hit cache rates of 80%+ in agent loops. Cached input tokens typically cost an order of magnitude less.
Route steps to the right model. Use a frontier model for planning and a fast, cheap model for extraction, formatting, and summarization. Multi-model routing is a first-class pattern on OBTO, not a hack.
Cap the loop. Set hard limits on iterations and context size per run. Pathological runs should fail fast and cheap.
Fix the failure rate. Improving a workflow's success rate from 70% to 90% cuts effective unit cost by 22% without touching a single token price.

Pricing models: structure matters as much as rate

When you evaluate platforms, look past the per-token rate to the structure. Seat-based pricing punishes you for adding teammates to a system whose whole point is reducing labor. Opaque "credits" make per-task math impossible by design. OBTO's pricing is deliberately structured for auditability — flat platform tiers plus metered tokens at published rates, with the Glass Receipt showing exactly where every token went. You can disagree with our rates; you'll never have to guess at them.

Getting started

Instrumenting cost tracking retroactively is painful; getting it from day one is nearly free if your platform does it natively. If you want to see per-run cost tracing on a real workflow, the getting-started guide takes about ten minutes, and the free Builder tier includes Glass Box tracing — enough to put a real number on your first automated task.

Agents earn their place in production the same way any system does: when the unit economics are visible and they work. Make the costs visible first. The rest of the decision gets much easier.