LLM Output Is Not Data

Somewhere in your production system, there is probably a line of code that does something like this: call an LLM, parse the response as JSON, and pass the result to a downstream function that expects a valid, well-typed object. Maybe there is a try/catch around the JSON parse. Maybe there is schema validation. More likely, there is not.

This pattern — treating LLM output as if it were structured data — is one of the most pervasive reliability mistakes in AI-integrated systems. The engineers building these pipelines are not careless. They understand that LLMs can produce unexpected output. They've just underestimated how deep the mismatch goes.

What LLM output actually is

When an LLM generates a response, it is sampling from a probability distribution over tokens. Given a prompt and a context window, the model produces what is statistically the most likely continuation — or, with nonzero temperature, a sample from the top of that distribution. The output is not retrieved from a store. It is not computed from a deterministic function. It is generated, one token at a time, by a process that has no mechanism for guaranteeing structural correctness.

Structured data — a database record, a validated API response, a typed function argument — has a contract. It will be the type it claims to be. Absent a bug, a string field will be a string, a required field will be present, an enum value will be one of the defined options. These guarantees exist because a human or a type system enforced them at the point of production.

LLM output has no such contract. The model was trained to produce token sequences that look like valid JSON when asked for JSON. It succeeds at this the vast majority of the time. "The vast majority of the time" is not "always," and in production systems, the tail matters.

The failure modes are not rare edge cases

The common mental model for LLM output failures is: occasionally the model returns something garbled, the parser throws, you handle the exception, you retry. This is accurate but incomplete. The more dangerous failures are the ones that don't throw.

A model asked to return a JSON object with a severity field constrained to ["low", "medium", "high"] might return "moderate" instead of "medium". That is a semantically valid response from the model's perspective — "moderate" is in the neighborhood of "medium." It is an invalid value for the downstream system that was expecting an enum member. Depending on how the receiving code handles unexpected enum values, this either silently defaults to a wrong severity level or propagates an error several function calls later, far from the LLM call that caused it.

A model asked to summarize a document might return a string that contains the phrase "Here is a JSON summary:" followed by the actual JSON. If your parsing code does JSON.parse(response) directly, it throws. If it strips leading text first, it might work. If there are two JSON blocks in the response — which can happen when the model is "thinking out loud" — you might parse the wrong one.

A model asked to extract a list of items might return an empty array when nothing matches, return a single item as a string instead of a single-element array, or return null. These are all semantically reasonable behaviors. They all break downstream code that assumes the field is always a non-null array.

The point is not that these are random unpredictable failures. They are predictable in a probabilistic sense — you can characterize the distribution of output shapes your model produces on a given task. But that distribution has tails, and at production volume, those tails show up.

Why this matters more than engineers usually acknowledge

Software systems are built on a foundation of contractual assumptions about data. Function A passes a value to function B; function B assumes the value satisfies certain constraints. This is so deeply embedded in how we write code that we often don't notice we're doing it. Static types make some of these contracts explicit. Runtime validation frameworks make others explicit. The rest live in the programmer's mental model.

When you insert an LLM into a data pipeline, you are inserting a non-deterministic process into a system built on deterministic contracts. The LLM call is a seam between the probabilistic world and the contractual world. If you don't treat it as such — if you don't place explicit, enforced schema validation at that seam — you have created a reliability time bomb.

The bomb has a long fuse. At low traffic, the tail failures are rare enough that you might not see one for weeks. You run the system, things work, you gain confidence. Then traffic increases, or you change the prompt slightly, or the model gets updated, and the tail starts showing up in your error logs — or worse, in your data, where it silently corrupts records for days before someone notices.

The engineering response

The first principle is: treat every LLM call boundary as an untrusted external input, with the same discipline you'd apply to user-submitted form data or a third-party API response.

That means schema validation is mandatory, not optional. Not just "catch the JSON parse exception" but full structural validation: required fields present, fields have the expected types, enum values are members of the defined set, numeric values are in the expected range. The validation layer at the LLM boundary should be at least as strict as the validation layer at your API boundary.

It means retry logic is necessary but not sufficient. When validation fails, you can retry the LLM call with a clarifying prompt, but you need a circuit breaker. Some prompts produce malformed output reliably under certain input conditions. Retrying indefinitely is not a fix; it's a latency amplifier.

It means your prompts and your schemas should be co-designed and version-controlled together. If the prompt changes, the expected output structure might change. If the schema changes, the prompt needs to reflect it. Treating these as separate concerns that happen to interact is how you get silent failures after a prompt update.

The deeper problem: confidence calibration

There is a subtler issue beyond structural validation. LLMs don't know what they don't know. When a model extracts a value from a document, it produces its best guess. When the document is ambiguous, the model still produces a confident-looking output. There is no "I'm not sure about this field" in standard JSON. The model either outputs a value or it doesn't, and the presence of a value communicates nothing about the model's actual confidence in it.

Downstream systems that consume LLM output typically have no visibility into this uncertainty. They receive a well-formed JSON object, pass validation, and proceed. The fact that the extracted value had a 60% confidence rate rather than a 95% confidence rate is lost at the boundary.

For applications where precision matters — medical coding, legal contract extraction, financial data normalization — this is a serious problem. The engineering responses here are more expensive: requiring the model to output explicit confidence scores, running multiple samples and checking for agreement, routing low-confidence outputs to human review. None of this is standard practice in most LLM integrations.

The fundamental reframe is this: LLM output is the output of a statistical process with known uncertainty. Data is a record with contractual guarantees. The moment you start treating the former as the latter without an explicit translation layer, you have introduced a class of reliability failures into your system that conventional software engineering practices weren't designed to catch.

That translation layer — validation, confidence handling, graceful degradation — is not boilerplate. It is the core engineering work of building reliable AI-integrated systems.

What LLM output actually is

The failure modes are not rare edge cases

Why this matters more than engineers usually acknowledge

The engineering response

The deeper problem: confidence calibration

Related posts

Subscribe to new posts