← All articles Engineering

The Harness and the Substrate

How we built an agent runtime on a data-native platform — and why the platform is the point.

OBTO Engineering · Insights from the Glass Box

Most "AI agent" architectures are a model loop bolted onto infrastructure that was never designed for agents. You bring a framework for the reasoning loop, then assemble the rest yourself: somewhere to persist conversations, a way to stream partial output to a UI, a multi-tenancy story, a governance story for which model runs, a deployment target for whatever the agent produces. Five systems, five seams, five places for state to drift.

We took a different bet. At OBTO, the platform an agent runs on and the platform an agent builds on are the same substrate — and that substrate is data. This post walks the architecture: the agent harness (especially its core), the platform underneath it, and where the two meet. It also separates what we've proven from what we've only designed for, because for this audience that line is the whole point.

The substrate: an app is data, not a build artifact

Start with the platform, because it's the part that makes the rest unusual.

On OBTO, the units of an application — server scripts, HTTP routes, pages, client components, stylesheets — are not files in a repository that get compiled, bundled, and shipped to a server. They are records in a database. A server module is a row. A route is a row. A 6,000-line frontend component is a row.

That sounds like a storage detail. It's actually the whole thesis, and it has three consequences that matter enormously when your primary developer is an agent.

1. Hot-reload is the default, not a dev-mode convenience. Application artifacts — server scripts, routes, the frontend itself — are resolved per request from the record store, so an edit takes effect on the next request: no rebuild, no bundle, no deploy pipeline. The change-to-live latency is a database write. (Platform-level modules that load once at process start — the MCP server itself, for one — are the exception; those still load on restart.) For a human team that's a convenience. For an agent iterating in a tight observe-edit-verify loop, it removes the single biggest source of cycle latency. We edited a 1,100-line model engine and a 6,000-line UI component this way, on the running system, while people were using it.

Editing live code with no build step sounds reckless, so here's what stands in for the compiler. Writes are record-level atomic — a request resolves either the old artifact or the new one, never a half-applied patch. Frontend components are syntax-validated on save, so a malformed edit is rejected before it persists. Server modules instantiate fresh per call, so a bad edit's blast radius is the requests in flight, and a revert is live on the very next one. Layered on top are the controls every change inherits: tenant- and owner-scoped writes, dual-mode rollout behind kill-switches, and one-flag rollback. A stronger pre-deploy correctness gate is defense-in-depth we're still hardening — additive, not the only thing between an edit and production.

2. Edits are surgical. Because an artifact is addressable, an agent patches it by line range instead of re-emitting the whole file — which also sidesteps a real failure mode of LLM-driven editing, where a model "rewrites" a large file and silently drops a function. You change the four lines you meant to change.

3. The application is queryable. The same property that makes code a record makes the shape of an app a graph you can query — dependencies, entry points, orphans, semantic search across the corpus. The agent doesn't guess at structure; it asks.

MCP as the front door, and a deliberately stateless contract

The platform exposes itself to agents through MCP (the Model Context Protocol). The decision worth calling out is that the contract is stateless: there is no server-side "currently active app." Every operation carries its own tenant identity — app and domain — explicitly, on every call.

This is a deliberate concurrency choice. With no per-conversation mutable session on the server, two conversations hitting the same server don't share state one can flip out from under the other — it removes an entire class of cross-conversation drift bug, one we did hit on an older stateful path, where a "set active app" in one conversation could bleed into another. To be precise: this is the design that removes that bug class, not a claim that every concurrent-access corner is now impossible — the stress testing that would prove the edges is on the list at the end. A side benefit: it tends to keep smaller, cheaper models on the rails, because correctness never depends on the model recalling hidden state — the target is always named in the call. (An impression from building, not a benchmarked result.)

Configuration follows the same philosophy. Models, routing policy, feature flags, and kill-switches live in tenant-scoped properties that hot-reload — change the default model, or disable a capability platform-wide, without shipping code.

The harness core: wrap and tee

Now the agent runtime — and the piece that carries the most weight here, because it's where naive designs rot.

The obvious way to add persistence, structured events, and governance to a model loop is to thread them through the loop: one function that runs the model and writes the database and formats the UI stream and enforces policy. Now a bug in persistence can stall the model, and a bug in the event formatter can take down the user's chat.

We don't thread concerns through the loop; the harness composes it. The difference is one of ordering — the naive loop writes the database on the same line it yields a token to the user; ours yields to the user first, then copies the frame to everything else.

The wrap-and-tee agent harness The model engine's output passes through a tee. One branch is the user stream, verbatim and unbuffered, on the critical path. The other branch is a copy that feeds an event adapter, a versioned event envelope, and then persistence and the UI — a failure-isolated side-channel that can fail without touching the user stream. User request Agent harness wrap + tee Model engine LangGraph tee User stream verbatim, unbuffered Event adapter frames to events Event envelope versioned, typed Persistence append-only, scoped UI events history, tokens, replay Side-channel may throw, stall, or fail — never touches the user stream. to user, first copy critical path — user stream side-channel — failure-isolated
The wrap-and-tee harness: the user stream is verbatim and on the critical path; a copy feeds a failure-isolated side-channel.

One principle holds it together: observation is a side-channel, never in the critical path. It's how we added history, token accounting, and replay to a live chat without putting any of them between the model and the user. It's also how we shipped it: the harness runs in raw passthrough (bytes verbatim) or structured mode, switchable per request and killable globally — so the new path proved itself in production beside the old one before becoming the default.

A stable event contract, so the UI doesn't marry the model runtime

Raw output from a modern agent runtime (we build on LangGraph) is a stream of framework-internal channels — message deltas, node updates, tool-call fragments, interrupts — and it shifts as the framework evolves. If your frontend parses that directly, every framework upgrade can break your UI.

So the harness's adapter translates raw runtime frames into a small, versioned event envelope: a typed vocabulary of run / message / reasoning / tool / diff / approval / usage events, each with a stable id, sequence number, and payload. The UI consumes that. Swap frameworks, add a provider, change runtimes — as long as the adapter still emits the contract, the frontend never notices. It's the discipline of an API boundary, applied between the agent runtime and everything downstream of it.

The contract in practice: agent-generated artifacts

A contract earns its keep the first time you extend the system without touching its core. Here's a recent example.

We wanted agents to produce real deliverables — a spreadsheet, a Word document, a slide deck, a PDF — that a user can both download and see in the workbench. The naive version returns the file's bytes as tool output, back through the model loop. That's the exact mistake the harness exists to prevent: a multi-megabyte base64 blob re-entering the model's context on the next turn, paid for in tokens, for a file the model already finished producing.

So the artifact never travels the reasoning path. The generation tool builds the file server-side and hands it to the harness's side-channel — the same tee that carries observation. The model gets back only a one-line receipt ("created sales.xlsx, 3 rows — delivered"); the bytes ride a frontend-only event the model never sees. Adding the whole capability was, almost entirely, adding one event type: the adapter maps a new artifact frame to a versioned artifact.ready envelope, and the UI renders it. The model loop, the user stream, and persistence were untouched. A new capability became a new event, not a new seam — which is the entire promise of the contract, collected in one feature.

The rendering side made the substrate point a second time. The chat panel and the code editor in our workbench are independent components with no shared store. Rather than wire one into the other's internals, the chat asks the editor to open a viewer tab by emitting a decoupled browser event; the editor, which already owned a tab strip, listens for it. Same discipline as the harness — talk over a contract, don't marry internals — one layer up, in the UI.

PDF was the cleanest illustration. We didn't bundle a PDF engine into the agent tool. The platform already had a hardened HTML-to-PDF service — a server script wrapping a managed headless-Chrome lifecycle, in production since 2020 — so the new tool simply delegates to it. On a data-native platform, a capability some other part of the system already has is one cross-script call away, not a dependency to re-import and re-harden.

And one honest limit, because it's the instructive part. A browser renders PDF natively, so that preview is the real, pixel-accurate file. It cannot natively render .docx or .pptx — there is no built-in viewer, and we don't run a server-side Office converter — so those previews are faithful structured outlines, with the true file one click away on the download button. We render what the medium renders honestly, and we don't fake the rest.

Governed model routing

Which model runs is a governed profile, not a hardcoded string. A registry (backed by a hot-reloading property) defines the available models and their declared capabilities, and the router enforces a couple of invariants, fail-closed:

Declared capability is still a promise a provider can break. We hit one that advertised tool-calling and then returned empty tool calls — capability gating trusted the declaration and couldn't catch it. What caught it was a different lever: provider routing, which pins a logical model to specific upstream providers. We rerouted to a known-good provider declaratively, with no code change. (Token usage is reported differently by each provider; the router normalizes it to one internal shape, with explicit estimated / exact / not-available attribution instead of invented numbers.)

The point of the layer: you cannot accidentally run an ungoverned model, one missing a required capability, or one you have no credentials for. The router refuses rather than degrades.

Correctness under concurrency

Persistence is where "works in the demo" quietly becomes "corrupts under load," so we built for concurrency up front rather than retrofitting it.

These are correct by design and covered by deterministic tests. The caveat we'll state plainly: those tests are single-threaded. We have a concurrency stress harness built but not yet run — so "safe under concurrent append" is a property we've reasoned about and indexed for, not one we've exercised under real contention. That's on the list below.

Parallel agents — what it is, and what it isn't

The headline people want from an agent platform is "run many agents at once." We built it, and we want to be exact about its shape.

The main agent can delegate to parallel subagents through a single tool call. They fan out, do read-only investigation concurrently, and report structured results back, which surface in the UI as nested cards under the delegating step. Each subagent's work is receipted against the parent run.

The useful lesson came from getting it wrong first. Our initial version handed every child a handle to the parent's single platform connection. Under real parallelism it deadlocked — the children just hung, no error. Sharing one stateful client across concurrent consumers is a classic way to deadlock; the fix was to give each child its own isolated connection. It's the kind of bug a single-threaded test never surfaces and concurrency reveals on the first try.

Two boundaries, stated plainly. This is fan-out read, not parallel write — we don't yet let multiple children mutate shared state at once, which needs isolation we haven't built. And it's verified at small fan-out, found via a live test, not a load test. The architecture is built to widen — isolated connections, per-child timeouts, independent receipts — but "built to widen" is a design claim until the load tests say otherwise.

Skills: authority you can't grant by accident

Like the model router, skills are fail-closed: they subtract authority, never add it. A skill is a packaged behavior contract — a prompt fragment plus a tool policy (deny / allow / prefer) — and it reshapes a turn by filtering the already-authorized tool set before the tools are ever offered to the model. The base set is the security boundary; a skill can narrow and reorder within it, but it can't conjure a tool the agent wasn't already permitted. A "read-only" skill strips the write tools; it cannot grant one. That keeps the authorization story in one place and makes skills safe to compose.

Why this is positioned differently

Stack the layers up and the differentiation is structural, not a feature list:

The payoff is a single typed interface an agent uses to think, edit, and deploy — and because application artifacts resolve per request, the loop from "decide" to "live" is one write. The runtime, the build target, and the data are the same thing. That's the part we haven't found a shortcut around imitating, because it isn't a feature — it's the substrate.

What's proven, and what's next

We'll close by separating what we've earned from what we've only engineered toward.

Proven, in production:

Designed for, but not yet measured:

Being specific about that second list is what earns the first. The architecture is real, it's running, and it's shaped for where we're taking it. We'll publish the load numbers when we have them.

— The OBTO engineering team

The runtime, the build target, and the data are the same thing

Build on the substrate an agent thinks, edits, and ships through — with flat, auditable pricing and no build step in the way.

Get started

More from the OBTO blog