← All articles Engineering

The Harness and the Substrate

How we built an agent runtime on a data-native platform — and why the platform is the point.

OBTO Engineering · Insights from the Glass Box

Most "AI agent" architectures are a model loop bolted onto infrastructure that was never designed for agents. You bring a framework for the reasoning loop, then assemble the rest yourself: somewhere to persist conversations, a way to stream partial output to a UI, a multi-tenancy story, a governance story for which model runs, a deployment target for whatever the agent produces. Five systems, five seams, five places for state to drift.

We took a different bet. At OBTO, the platform an agent runs on and the platform an agent builds on are the same substrate — and that substrate is data. This post walks the architecture: the agent harness (especially its core), the platform underneath it, and where the two meet. It also separates what we've proven from what we've only designed for, because for this audience that line is the whole point.

The substrate: an app is data, not a build artifact

Start with the platform, because it's the part that makes the rest unusual.

On OBTO, the units of an application — server scripts, HTTP routes, pages, client components, stylesheets — are not files in a repository that get compiled, bundled, and shipped to a server. They are records in a database. A server module is a row. A route is a row. A 6,000-line frontend component is a row.

That sounds like a storage detail. It's actually the whole thesis, and it has three consequences that matter enormously when your primary developer is an agent.

1. Hot-reload is the default, not a dev-mode convenience. Application artifacts — server scripts, routes, the frontend itself — are resolved per request from the record store, so an edit takes effect on the next request: no rebuild, no bundle, no deploy pipeline. The change-to-live latency is a database write. (Platform-level modules that load once at process start — the MCP server itself, for one — are the exception; those still load on restart.) For a human team that's a convenience. For an agent iterating in a tight observe-edit-verify loop, it removes the single biggest source of cycle latency. We edited a 1,100-line model engine and a 6,000-line UI component this way, on the running system, while people were using it.

Editing live code with no build step sounds reckless, so here's what stands in for the compiler. Writes are record-level atomic — a request resolves either the old artifact or the new one, never a half-applied patch. Frontend components are syntax-validated on save, so a malformed edit is rejected before it persists. Server modules instantiate fresh per call, so a bad edit's blast radius is the requests in flight, and a revert is live on the very next one. Layered on top are the controls every change inherits: tenant- and owner-scoped writes, dual-mode rollout behind kill-switches, and one-flag rollback. A stronger pre-deploy correctness gate is defense-in-depth we're still hardening — additive, not the only thing between an edit and production.

2. Edits are surgical. Because an artifact is addressable, an agent patches it by line range instead of re-emitting the whole file — which also sidesteps a real failure mode of LLM-driven editing, where a model "rewrites" a large file and silently drops a function. You change the four lines you meant to change.

3. The application is queryable. The same property that makes code a record makes the shape of an app a graph you can query — dependencies, entry points, orphans, semantic search across the corpus. The agent doesn't guess at structure; it asks.

MCP as the front door, and a deliberately stateless contract

The platform exposes itself to agents through MCP (the Model Context Protocol). The decision worth calling out is that the contract is stateless: there is no server-side "currently active app." Every operation carries its own tenant identity — app and domain — explicitly, on every call.

This is a deliberate concurrency choice. With no per-conversation mutable session on the server, two conversations hitting the same server don't share state one can flip out from under the other — it removes an entire class of cross-conversation drift bug, one we did hit on an older stateful path, where a "set active app" in one conversation could bleed into another. To be precise: this is the design that removes that bug class, not a claim that every concurrent-access corner is now impossible — the stress testing that would prove the edges is on the list at the end. A side benefit: it tends to keep smaller, cheaper models on the rails, because correctness never depends on the model recalling hidden state — the target is always named in the call. (An impression from building, not a benchmarked result.)

Configuration follows the same philosophy. Models, routing policy, feature flags, and kill-switches live in tenant-scoped properties that hot-reload — change the default model, or disable a capability platform-wide, without shipping code.

The harness core: wrap and tee

Now the agent runtime — and the piece that carries the most weight here, because it's where naive designs rot.

The obvious way to add persistence, structured events, and governance to a model loop is to thread them through the loop: one function that runs the model and writes the database and formats the UI stream and enforces policy. Now a bug in persistence can stall the model, and a bug in the event formatter can take down the user's chat.

We don't thread concerns through the loop; the harness composes it. The difference is one of ordering — the naive loop writes the database on the same line it yields a token to the user; ours yields to the user first, then copies the frame to everything else.

The wrap-and-tee harness: the user stream is verbatim and on the critical path; a copy feeds a failure-isolated side-channel.

The harness wraps the existing model engine, injects a governed model, and returns a tee over its output stream. The user's bytes flow straight through, untouched.
A copy of that stream feeds the side-channel: frames reassembled, mapped by an adapter into structured events, persisted.
The observation path is failure-isolated by construction — it sits downstream of the tee, so if the adapter throws or persistence is down, observability degrades and the product does not.

One principle holds it together: observation is a side-channel, never in the critical path. It's how we added history, token accounting, and replay to a live chat without putting any of them between the model and the user. It's also how we shipped it: the harness runs in raw passthrough (bytes verbatim) or structured mode, switchable per request and killable globally — so the new path proved itself in production beside the old one before becoming the default.

A stable event contract, so the UI doesn't marry the model runtime

Raw output from a modern agent runtime (we build on LangGraph) is a stream of framework-internal channels — message deltas, node updates, tool-call fragments, interrupts — and it shifts as the framework evolves. If your frontend parses that directly, every framework upgrade can break your UI.

So the harness's adapter translates raw runtime frames into a small, versioned event envelope: a typed vocabulary of run / message / reasoning / tool / diff / approval / usage events, each with a stable id, sequence number, and payload. The UI consumes that. Swap frameworks, add a provider, change runtimes — as long as the adapter still emits the contract, the frontend never notices. It's the discipline of an API boundary, applied between the agent runtime and everything downstream of it.

The contract in practice: agent-generated artifacts

A contract earns its keep the first time you extend the system without touching its core. Here's a recent example.

We wanted agents to produce real deliverables — a spreadsheet, a Word document, a slide deck, a PDF — that a user can both download and see in the workbench. The naive version returns the file's bytes as tool output, back through the model loop. That's the exact mistake the harness exists to prevent: a multi-megabyte base64 blob re-entering the model's context on the next turn, paid for in tokens, for a file the model already finished producing.

So the artifact never travels the reasoning path. The generation tool builds the file server-side and hands it to the harness's side-channel — the same tee that carries observation. The model gets back only a one-line receipt ("created sales.xlsx, 3 rows — delivered"); the bytes ride a frontend-only event the model never sees. Adding the whole capability was, almost entirely, adding one event type: the adapter maps a new artifact frame to a versioned artifact.ready envelope, and the UI renders it. The model loop, the user stream, and persistence were untouched. A new capability became a new event, not a new seam — which is the entire promise of the contract, collected in one feature.

The rendering side made the substrate point a second time. The chat panel and the code editor in our workbench are independent components with no shared store. Rather than wire one into the other's internals, the chat asks the editor to open a viewer tab by emitting a decoupled browser event; the editor, which already owned a tab strip, listens for it. Same discipline as the harness — talk over a contract, don't marry internals — one layer up, in the UI.

PDF was the cleanest illustration. We didn't bundle a PDF engine into the agent tool. The platform already had a hardened HTML-to-PDF service — a server script wrapping a managed headless-Chrome lifecycle, in production since 2020 — so the new tool simply delegates to it. On a data-native platform, a capability some other part of the system already has is one cross-script call away, not a dependency to re-import and re-harden.

And one honest limit, because it's the instructive part. A browser renders PDF natively, so that preview is the real, pixel-accurate file. It cannot natively render .docx or .pptx — there is no built-in viewer, and we don't run a server-side Office converter — so those previews are faithful structured outlines, with the true file one click away on the download button. We render what the medium renders honestly, and we don't fake the rest.

Governed model routing

Which model runs is a governed profile, not a hardcoded string. A registry (backed by a hot-reloading property) defines the available models and their declared capabilities, and the router enforces a couple of invariants, fail-closed:

Capability gating. If a turn needs tool-calling, the router resolves only a model whose profile declares tool-calling — and refuses otherwise. It does not try-and-see.
Credentials. A model with no usable key is unroutable by construction, not by runtime surprise.

Declared capability is still a promise a provider can break. We hit one that advertised tool-calling and then returned empty tool calls — capability gating trusted the declaration and couldn't catch it. What caught it was a different lever: provider routing, which pins a logical model to specific upstream providers. We rerouted to a known-good provider declaratively, with no code change. (Token usage is reported differently by each provider; the router normalizes it to one internal shape, with explicit estimated / exact / not-available attribution instead of invented numbers.)

The point of the layer: you cannot accidentally run an ungoverned model, one missing a required capability, or one you have no credentials for. The router refuses rather than degrades.

Correctness under concurrency

Persistence is where "works in the demo" quietly becomes "corrupts under load," so we built for concurrency up front rather than retrofitting it.

Tenant scoping is structural. Domain, app, and user are stamped onto every read and write — a query can't forget to scope itself, and a write that would cross a user boundary is rejected by an ownership check.
The conversation log is append-only with a monotonic sequence, protected by a unique index on (tenant, sequence): a colliding append fails atomically and the writer retries with the next number, so two concurrent appends can't silently clobber each other.

These are correct by design and covered by deterministic tests. The caveat we'll state plainly: those tests are single-threaded. We have a concurrency stress harness built but not yet run — so "safe under concurrent append" is a property we've reasoned about and indexed for, not one we've exercised under real contention. That's on the list below.

Parallel agents — what it is, and what it isn't

The headline people want from an agent platform is "run many agents at once." We built it, and we want to be exact about its shape.

The main agent can delegate to parallel subagents through a single tool call. They fan out, do read-only investigation concurrently, and report structured results back, which surface in the UI as nested cards under the delegating step. Each subagent's work is receipted against the parent run.

The useful lesson came from getting it wrong first. Our initial version handed every child a handle to the parent's single platform connection. Under real parallelism it deadlocked — the children just hung, no error. Sharing one stateful client across concurrent consumers is a classic way to deadlock; the fix was to give each child its own isolated connection. It's the kind of bug a single-threaded test never surfaces and concurrency reveals on the first try.

Two boundaries, stated plainly. This is fan-out read, not parallel write — we don't yet let multiple children mutate shared state at once, which needs isolation we haven't built. And it's verified at small fan-out, found via a live test, not a load test. The architecture is built to widen — isolated connections, per-child timeouts, independent receipts — but "built to widen" is a design claim until the load tests say otherwise.

Skills: authority you can't grant by accident

Like the model router, skills are fail-closed: they subtract authority, never add it. A skill is a packaged behavior contract — a prompt fragment plus a tool policy (deny / allow / prefer) — and it reshapes a turn by filtering the already-authorized tool set before the tools are ever offered to the model. The base set is the security boundary; a skill can narrow and reorder within it, but it can't conjure a tool the agent wasn't already permitted. A "read-only" skill strips the write tools; it cannot grant one. That keeps the authorization story in one place and makes skills safe to compose.

Why this is positioned differently

Stack the layers up and the differentiation is structural, not a feature list:

Versus build-deploy-CDN app platforms: there's no build step and no deploy step. The artifact is the running code — edited live, addressable, queryable as data. The iteration loop an agent needs is the platform's native loop, not a mode it has to fight.
Versus agent frameworks: most hand you the reasoning loop and leave the rest — persistence, multi-tenancy, eventing, governance, a deployment target — for you to assemble. Here those are one substrate behind one typed interface.
Versus closed agent products: the surface the agent uses is open to any MCP client. It's infrastructure, not a sealed box.

The payoff is a single typed interface an agent uses to think, edit, and deploy — and because application artifacts resolve per request, the loop from "decide" to "live" is one write. The runtime, the build target, and the data are the same thing. That's the part we haven't found a shortcut around imitating, because it isn't a feature — it's the substrate.

What's proven, and what's next

We'll close by separating what we've earned from what we've only engineered toward.

Proven, in production:

Live iteration on a running system — per-request artifact resolution genuinely delivers it; it was the daily reality of building everything above.
The wrap-and-tee harness running in production: a user stream kept intact while a full observation / persistence / eventing path runs beside it, behind kill-switches, with the old path one flag away.
Governed routing rerouting around an upstream that declared a capability and then broke it — declaratively, no code change.
Parallel read-only subagents completing correctly once given isolated connections.

Designed for, but not yet measured:

Heavy-load performance. No formal load tests, soak tests, or latency/throughput SLOs yet. The shape is right for scale — stateless server contract, append-only stores with conflict-retry, failure-isolated observation, per-child isolation for parallelism — but shape is a hypothesis until it's benchmarked. The next milestone is exactly that: concurrency load tests, p99 latency and error-rate SLOs under sustained parallel runs, and contention testing on the storage path.
Parallel writes across subagents, which need an isolation model we haven't built.
A pre-deploy correctness gate strong enough to match the speed of hot-reload — the flip side of having no build step, and active work.

Being specific about that second list is what earns the first. The architecture is real, it's running, and it's shaped for where we're taking it. We'll publish the load numbers when we have them.

— The OBTO engineering team