← All articles Engineering

How we built the Glass Receipt

OBTO Team · Insights from the Glass Box

The first time one of our own agents ran up a bill we couldn't explain, we had exactly two numbers: the prompt that went in, and an invoice that landed three weeks later. Everything in between was dark. Somewhere in that dark, an agent had retried a broken tool a few hundred times before giving up. We noticed only because the total looked wrong.

That stung, because we were busy building a platform whose entire pitch is that you should own your intelligence instead of renting a black box. And there we were, squinting at our own. The Glass Receipt began as the fix for that embarrassment.

A receipt, not a log

Logs already existed. The trouble with a log is that it's written for an engineer at 2 a.m. who already knows what they're hunting for. We wanted something a finance lead, a support manager, or a brand-new user could read without a decoder ring.

So we borrowed the most boring, most trusted object in commerce. When you buy coffee you don't get a "trace." You get an itemized slip: each line, the tax, the total. It's small, complete, and yours. Every agent run on OBTO should hand back the same thing. What it did, which models it called, how many tokens each step burned, what that cost, and what data it touched.

What every run records

A receipt has to capture four things, and capture them as the run happens rather than reconstruct them afterward:

Per-call metering. Tokens in, out, and cached; the model used; latency. Numbers reconstructed from an invoice arrive too late to change a decision.
The step trace. Every reasoning step and tool call, in order, so a single run reads top to bottom like a short story.
Cost, priced at the moment of the call, from published model rates. An API-key-level statement weeks later can't tell you which agent spent the money.
Status, including failure. Retries, timeouts, and dead ends are line items, not footnotes.

Capture at the boundary

One decision made the rest easy: instrument the boundary, not the business logic. Every model call and every tool call on OBTO already passes through a thin wrapper, and the wrapper writes the receipt, not the person who wrote the agent. Roughly:

async function meteredCall(model, messages, ctx) {
  const t0 = Date.now();
  const res = await model.run(messages);
  ctx.receipt.record({
    step:   ctx.step++,
    model:  model.id,
    tokens: res.usage,                  // input, cached, output
    ms:     Date.now() - t0,
    cost:   price(model.id, res.usage)  // computed now, not from a bill
  });
  return res;
}

Because the metering lives at the boundary, you can't forget to add it and you can't opt a workflow out of being measured. A receipt isn't a feature you remember to switch on. It's part of the cost of making a call at all.

The receipt is data, not a PDF

Here is where OBTO's house style earned its keep. On this platform an app is data and tools are data, so a receipt is data too: a record you can query, not a document you screenshot. A finished run lands as something close to this:

{
  "run_id": "run_8f3a",
  "workflow": "invoice-triage",
  "status": "completed",
  "steps": 14,
  "tokens": { "input": 182400, "cached": 121000, "output": 9300 },
  "tool_calls": [
    { "tool": "lookup_account", "ms": 410, "result_tokens": 2900 },
    { "tool": "post_update",    "ms": 220, "result_tokens": 140 }
  ],
  "cost": { "llm": 0.0291, "tools": 0.0040, "total": 0.0331 }
}

Once a run is a record, the hard questions get short answers. What did invoice-triage cost yesterday? Which step ate the tokens? Which tool keeps returning a 3,000-token payload the model never reads? Those stop being archaeology and become a query you run while the work is still warm. Cost tracking and observability turned out to be the same plumbing read from two directions.

Failures get receipts too

The runs you most want a receipt for are the ugly ones. An agent that loops on a flaky API and burns real tokens before failing produces the most expensive minute of your week, and it's exactly the run that an average quietly buries. So failed runs print a full receipt: same itemization, plus the retry count and the wall they hit. The slip that explains a disaster is usually the one that prevents the next.

What it cost us to build

None of this is free, and pretending otherwise would be off-brand. A receipt for every run means the metering itself has to be cheap enough to leave on forever, so the hot path does append-only writes and defers the heavy aggregation. Verbose tools were the real headache: a query that returns a few hundred rows can outweigh the rest of a run, so we record payload sizes everywhere and keep the full blobs only where they earn it. The honest tension was never whether to trace. It was how granular to go, and we kept erring toward more.

Why it ended up at the center

The surprise was that three problems we had been treating separately are one problem wearing three hats. Finance asks what a run cost. Engineering asks why it did that. Compliance asks what it touched. A single receipt answers all three, which is also why Glass Box tracing ships on the free Builder tier — transparency you have to upgrade to unlock isn't really transparency.

If you've ever read your AI spend off an invoice and felt your stomach drop, that drop is the thing we built this to remove. You can put a real number on your first automated task in about ten minutes from the getting-started guide, and the receipt is there from run one. Describe it, ship it, and finally see exactly what it did.