Testing AI Features: Why Unit Tests Lie and What to Do Instead

You ship an LLM-powered feature. All your unit tests pass. CI is green. You deploy on Friday afternoon. Saturday morning, support has 40 tickets — the AI is hallucinating customer names, writing emails with the wrong tone, recommending products that don't exist.

The unit tests didn't catch any of this. They were never going to.

Testing AI features requires a different framework than testing deterministic code. The unit-test mindset actively misleads you.

Why unit tests lie about LLM behavior

A traditional unit test:

def test_format_date():
    assert format_date("2026-04-30") == "April 30, 2026"

The function is deterministic. One input, one output. Test once, done.

An LLM-powered test:

def test_summarize():
    result = summarize("Long article text...")
    assert "Apple" in result
    assert "earnings" in result

The LLM is probabilistic. Run it 100 times, get 100 different outputs. Your test sees one. It might pass on output #47 and fail on output #48. Even with temperature=0, model updates change behavior.

Worse: your test only catches the most superficial form of breakage ("the word 'Apple' appears"). It misses:

Wrong but plausible output (hallucinated facts)
Right output but wrong tone
Right output but missing important info
Refusals or hedging that wreck the UX

Unit tests provide false confidence. Green CI, broken behavior in production.

What evals are

An "eval" is structured measurement of LLM behavior on a representative dataset. The vocabulary differs from testing on purpose:

| Unit test | Eval | |-----------|------| | Pass/fail | Score (0-100% or rubric) | | Single input | Dataset of 50-1000 examples | | Tests one function | Tests one capability | | Run on every commit | Run on every model/prompt change | | Maintained by code authors | Maintained by product + engineering |

Evals are what shipped GPT-4. They are how Anthropic and OpenAI iterate. If you're building on LLMs, you need them.

The minimum viable eval

You need three things:

1. A dataset. 50-200 representative examples. Real user queries, paired with what good output looks like.

{"input": "Cancel my subscription", "expected_intent": "cancellation", "expected_tone": "empathetic"}
{"input": "Why am I being charged twice?", "expected_intent": "billing_dispute", "expected_tone": "apologetic"}

2. A scorer. Code that judges outputs. Three styles:

Exact match / regex — for structured output (intent, JSON schemas, classifications)
LLM-as-judge — another LLM scores quality on a rubric. Cheap, scalable, surprisingly accurate
Human review — gold standard for subjective qualities. Don't skip entirely, just sample.

3. A runner. Loops through dataset, calls your model, scores results, reports aggregate. Tools: Promptfoo, OpenAI evals, LangSmith, Braintrust, or 50 lines of Python.

What to score

For a typical chat/agent feature, the categories I'd track:

Task completion — did it do what was asked? (LLM-as-judge or scripted)
Factuality — no hallucinated info (LLM-as-judge against retrieved context)
Tone / style — matches brand voice (LLM-as-judge with examples)
Refusal rate — when should it refuse? (Curated edge cases)
Latency — p50/p95 (just measure)
Cost — tokens per task (just measure)

You don't need all of these on day one. Start with task completion. Add categories as you find failure modes in production.

LLM-as-judge: the surprisingly good shortcut

A second LLM scoring outputs sounds dubious. In practice it's:

Cheaper than humans by 100x
Faster than humans by 1000x
Correlated with human judgment at ~0.7+ for most quality dimensions
Easy to scale across thousands of examples

The trick: give the judge a rubric, examples of good and bad, and ask for a score with reasoning.

You are scoring a customer support reply.

Rubric:
- 5: Perfect — accurate, empathetic, actionable
- 3: Acceptable — accurate but tone-off OR right tone but missing detail
- 1: Bad — inaccurate or actively harmful

Reply to score: <REPLY>

Output JSON: { "score": 1-5, "reasoning": "..." }

Use a strong model (Claude Opus, GPT-4) as the judge — judging is harder than generating in many cases.

Where LLM-as-judge breaks

It's bad for:

Subtle subjective qualities (humor, voice, brand alignment) — calibrate with humans
Truly novel outputs the judge has no rubric for
Adversarial cases (the judge has the same biases as the generator)

For these, you need humans. Sample 50 outputs/week, have a domain expert score them, calibrate the LLM judge against the human scores.

CI integration

Evals don't run on every commit (too slow, too expensive). They run on:

Prompt changes
Model upgrades (Sonnet 4.6 → 4.7)
Major code changes to the AI pipeline
Pre-release before deploying

For each, output the new vs. baseline scores. PR comment if scores regress. Block merge if a critical metric drops.

Evals: customer_support_v2

| Metric          | Baseline | This PR | Δ      |
|-----------------|----------|---------|--------|
| Task completion | 87%      | 89%     | +2%    |
| Factuality      | 94%      | 91%     | -3% ⚠  |
| Tone match      | 81%      | 80%     | -1%    |
| Latency p95     | 2100ms   | 2150ms  | +50ms  |
| Cost per task   | $0.012   | $0.011  | -$0.001|

The factuality regression blocks merge. Engineer investigates, finds the new prompt encourages over-confident statements, fixes.

Production monitoring is the real eval

Evals catch known failure modes. Production catches the rest.

For LLM-powered features, monitor:

User downvotes / corrections / re-prompts
Conversation drop-off rates
Support ticket volume mentioning AI
Manual review of 100 random conversations/week

Surprising production failures become eval examples. The dataset grows.

The takeaway

Don't unit-test LLM features. They're the wrong tool. Build an eval suite — dataset + scorer + runner — and run it on every prompt or model change. Pair it with production monitoring to catch failures the eval doesn't predict. You'll ship AI features with a lot more confidence and a lot less weekend support volume.