← All articles Engineering

LLM Data Cleaning: Patterns for Production ETL Pipelines

OBTO Team · Insights from the Glass Box

Every data team has a folder of horrors: a CSV with three different date formats, a vendor column with seventeen spellings of the same company, an address field that's sometimes structured and sometimes a paragraph. Traditional ETL is excellent at structure and helpless against ambiguity. LLMs are the first tool that's good at exactly the ambiguous part — but dropping a model into the middle of a pipeline naively gets you something slow, expensive, and non-deterministic. This guide covers the patterns that make LLM data cleaning work in production.

Where LLMs help — and where they don't

The first rule of LLM data cleaning is to use as little LLM as possible. If you can write the rule, write the rule. Trimming whitespace, casting types, parsing ISO dates, validating emails — deterministic code does all of this for free, instantly, and identically every time. A model adds latency, cost, and a nonzero error rate to problems that have exact solutions.

Where rules run out is where models earn their keep: recognizing that IBM, International Business Machines, and I.B.M. Corp. are one vendor; extracting structured fields from a free-text address; mapping a messy product description onto your taxonomy; inferring that "2.5k" in a salary column means 2,500 and not 2.5. These are judgment calls, and judgment is what you're paying the model for.

The hybrid pipeline pattern

The architecture that consistently works is rules first, model second, humans last:

for (const row of batch) {
  const cleaned = applyDeterministicRules(row);   // free, exact
  const result  = validate(cleaned, schema);      // JSON Schema / types

  if (result.ok) { emit(cleaned); continue; }     // most rows stop here

  const repaired = await llmRepair(cleaned, {     // model only on failures
    schema, errors: result.errors, temperature: 0
  });

  if (validate(repaired, schema).ok && repaired.confidence >= 0.85) {
    emit({ ...repaired, provenance: "llm" });
  } else {
    reviewQueue.push({ row, repaired });          // humans on the residue
  }
}

The key property: only rows that fail deterministic validation ever reach the model. In most real datasets that's a small fraction of rows — which changes the economics completely. You're not paying to run a model over a table; you're paying to adjudicate its exceptions.

Four cleaning tasks LLMs do well

1. Schema extraction from free text

Turning "Bldg 4, 220 5th Ave, NYC 10001, attn: J. Park" into street, city, postal code, and contact fields. Constrain the output with a JSON schema and reject anything that doesn't parse — never regex the model's prose. Structured-output modes exist on every major provider now; use them.

2. Entity resolution and deduplication

The mistake to avoid is asking the model to compare rows pairwise — that's O(n²) model calls and your budget is gone by lunch. Instead, generate candidate pairs cheaply (embeddings, trigram similarity, or blocking on normalized keys) and use the LLM only to adjudicate the candidates. The model sees hundreds of pairs, not millions.

3. Categorization against a taxonomy

Give the model the full category list and a handful of labeled examples, and ask for a category plus a confidence score. Below your threshold, route to review. Resist the urge to let it invent new categories — closed-world prompts are dramatically more consistent.

4. Format and unit normalization

Currencies, units, phone formats. The reliable division of labor: the model identifies what it's looking at ("this is GBP, in thousands"), deterministic code does the conversion. Don't let the model do arithmetic it doesn't need to do — that's a known failure mode with an easy workaround.

Making it trustworthy

A cleaning pipeline rewrites your data. That deserves the same discipline as a database migration.

Validate every model output against the target schema. Invalid output gets one retry with the validation errors included in the prompt, then goes to the review queue. Never emit unvalidated model output downstream.
Keep provenance. Store the original value, the cleaned value, what produced it (rule, model + version, or human), and the prompt version. When someone asks "why does this row say that?", you want an answer, not a shrug — the same glass-box principle we apply to agent observability.
Pin everything. Temperature 0, pinned model versions, versioned prompts. You won't get perfect determinism from an LLM, but you can get reproducibility of everything around it.
Audit the confident answers too. Sample a slice of high-confidence outputs for human review every run. Confidence scores drift; your sampling shouldn't.

Controlling cost

Cleaning calls are short, structured, and routine — exactly the workload that does not need a frontier model. Route repairs to a fast, cheap model and reserve the expensive one for the gnarliest residue; this is the same routing logic we covered in multi-model orchestration. Batch rows into single calls where the schema allows it, and cache the system prompt — taxonomy and schema definitions are identical across calls, which makes them ideal cache material.

The metric to watch is cost per cleaned row, measured against success rate — the same unit-economics discipline as agent cost tracking. A short schema-repair call on a small model costs a fraction of a cent at published per-token rates; the budget killer is sending entire tables through a frontier model because it was the path of least resistance.

Running it in production

Operationally, a cleaning pipeline is an agent workflow: scheduled runs, tool calls, per-run traces, a review queue. On OBTO you can wrap your cleaning steps as MCP tools, schedule the pipeline, and get a per-run ledger of every model call and token by default — so "what did last night's run change, and what did it cost?" is a query, not an investigation.

If you want to try the pattern on a real dataset, the getting-started guide takes about ten minutes, and the free Builder tier on our pricing page is enough to run a hybrid pipeline end to end. Start with the rules, add the model where the rules run out, and keep humans on the residue. That ordering is the whole trick.