Skip to main content
← All posts
9 min read

The AI Measurement Trap: Why Your Best-Ever DORA Numbers Should Scare You

AI is making your DORA metrics look incredible while hiding real problems. Here is why every DORA number is now suspect — and what elite teams measure instead.

Share

Your deployment frequency is up 41%. Lead time to change is half what it was last year. Change failure rate is holding at an elite-tier 2.1%. Your DORA metrics have never looked better.

You should be worried.

AI is doing something subtle and dangerous to engineering teams right now: it's making all the wrong numbers go up. The metrics we built to measure healthy engineering — DORA, velocity, cycle time — were designed for a world where humans write code. That world is gone. And the measurement frameworks we haven't updated are now actively misleading leaders who depend on them.

This isn't a post about AI making your team worse. It usually doesn't. This is about something harder to fix: you can no longer tell the difference between a team that's genuinely improving and one that's accumulating invisible risk — because the numbers look identical.


What DORA Was Built For

DORA (DevOps Research and Assessment) came out of a Google research program that spent years studying software delivery practices across thousands of teams. The four metrics — deployment frequency, lead time for changes, change failure rate, and mean time to restore — were designed to measure the health of a delivery process driven by human decisions and human output.

The model rested on a set of assumptions that were entirely reasonable in 2018:

  • More frequent deployments mean smaller batches, less risk per change, better engineering habits
  • Shorter lead time means less process friction and faster feedback loops
  • Lower change failure rate means quality practices are working
  • Fast restore time means good incident culture and operational maturity

Every one of those assumptions held. Then 75% of professional developers started relying on AI for at least half their work.


How AI Breaks Each DORA Metric

DORA metrics inflated by AI-generated code

Deployment Frequency: inflated by scaffolding

AI can generate a pull request in under a minute. Boilerplate, configuration, tests, documentation — code that used to take a senior engineer a day comes back in 20 minutes of iteration.

Result: deployment frequency goes up. But not because your engineering culture improved. Because AI is shipping more commits with less signal per commit. The metric no longer distinguishes between "we've improved our batching discipline" and "we're pushing AI output into production faster."

The downstream effect is worse: teams under velocity pressure review AI-generated PRs in less time. More commits hitting review means less attention per commit. You're measuring throughput while oversight quietly degrades.

Lead Time for Changes: shrunk by generation, hidden by review

AI collapses the "time to write the code" part of your lead time to near zero. A feature that took three days to implement now takes three hours of generation and iteration with an agent. Your lead time metric drops dramatically — and it looks like your engineering process got more efficient.

What it doesn't capture: review time for AI-generated code is longer, not shorter. Reviewers are reading code they didn't write, don't always understand, and can't intuit. The muscle of "I know what this function does because I know how the author thinks" disappears completely when an agent wrote it.

A recent analysis across AI-adopting teams found lead times dropped 35–50% while self-reported reviewer confidence dropped 22% over the same period. The number looks great. The comprehension doesn't.

Change Failure Rate: looks fine until it isn't

This is the most dangerous one.

AI-generated code passes CI. It passes lint. It usually passes code review. It fails in production in ways that are genuinely hard to predict — subtle race conditions, unexpected edge cases in business logic, integration behaviors that only surface under real load or specific user flows.

DORA's change failure rate measures: "did this deployment cause an incident in the 24–72 hours after deploy?" That is a very specific window. AI-generated code is particularly prone to latent failures: bugs that sit dormant for weeks and surface only when the right edge case is hit.

The 2025 DORA Report found that teams with high AI adoption and no AI-specific quality gates saw a 7.2% decrease in deployment stability — while their standard change failure rate metric was at all-time lows. They thought they were elite. They were accumulating debt they couldn't see.

Mean Time to Restore: average looks fine, P0s are brutal

AI tools genuinely help here. They assist with root cause analysis, generate fix suggestions, draft runbooks. So MTTR often improves — and that's real. AI is a legitimate operational win.

The problem is that AI-generated incidents tend to be novel failures — patterns your on-call engineers haven't seen before, in code they didn't write and may not fully understand. Novel failures resolve slower, even with AI assistance. Your MTTR average can look healthy while your P0 incidents are taking twice as long because nobody on the pager actually knows the system that failed.

The average hides the catastrophic outliers.


The Latent Defect Problem

The deepest issue is one that DORA's architecture fundamentally cannot address.

The latent defect window — what DORA misses

DORA's change failure rate closes the book on a deployment within days of it going live. If nothing explodes in that window, the deployment is logged as a success. Your metric improves.

AI-generated code introduces a different failure pattern. The code works fine for weeks. It passes every automated check. It survives the first few thousand production requests. Then someone hits the edge case — a specific data format, a particular sequence of events, a load pattern the tests never simulated — and you have a P0 incident 37 days after that "successful" deploy.

DORA never saw it. Your change failure rate never saw it. The metric for that deploy says "elite tier."

I call this the latent defect window — the gap between when a bug is introduced and when it surfaces, which AI dramatically widens. Human engineers tend to introduce bugs they'd recognize if they read the code again. AI agents introduce bugs that are structurally correct but semantically wrong, and nobody on the team has the intuition to catch them in review.

The practical implication: your change failure rate is increasingly measuring whether your tests are comprehensive, not whether your code is correct.


What Elite Teams Measure Instead

The answer isn't to throw out DORA. It's to understand what DORA is now measuring — process throughput — and add the three things AI makes invisible.

The augmented measurement stack

Layer 1: AI Attribution

Before you interpret any delivery metric, you need to know: what percentage of that change was AI-generated?

This isn't about blame or policing AI usage. It's about context. A deployment that's 10% AI-assisted and one that's 90% AI-generated carry different risk profiles, different review requirements, and different failure modes. Treating them as equivalent is like treating a surgical checklist and a vibe as the same quality process.

If you're running an LLM proxy (you should be — it gives you cost visibility and rate limiting), you have this data. Tool telemetry from IDE extensions like Cursor or GitHub Copilot can provide it. Even a simple PR convention where authors note AI involvement gives you signal.

Practical rule: flag any PR with 70%+ AI-generated content for a dedicated second reviewer. Not as a punishment — as a quality gate calibrated to the risk profile.

Layer 2: DX Core 4

The DX Core 4 framework, developed by researchers at DX (the developer experience analytics platform), is the most credible DORA successor for AI-era teams. It measures four dimensions:

  • Speed — traditional delivery velocity, DORA-compatible
  • Effectiveness — are engineers achieving goals, or just shipping code?
  • Quality — defect rates, with AI-code-specific signals layered in
  • Impact — business outcomes tied to engineering output

The critical addition over DORA is that DX Core 4 takes developer experience seriously as a leading indicator, not an afterthought. An engineering team that's burning out under AI review pressure, losing comprehension of their own codebase, and shipping faster than they can understand — that degradation shows up in DX Core 4 before it shows up in incidents. In DORA, it never shows up at all.

Layer 3: Developer Experience Signals

The cheapest, most underused signal available to any engineering leader is this one question asked post-merge:

"How confident are you that this change behaves as intended in production?"

Survey the author. Survey at least one reviewer. Track trends over time.

This sounds trivially simple. It's not trivially useful. Falling confidence is a leading indicator — it tells you your team is losing comprehension of what they're shipping before the failures arrive. Rising incident rates are a lagging indicator — they tell you after the damage is done.

Add a latent defect tracking layer alongside this: separate your "incidents caused by this deployment" (DORA's CFR) from "bugs discovered that were introduced 30+ days ago." Keep both numbers. Watch the second one closely. AI teams see the second number grow while the first stays flat.


The Three Questions for Non-Technical Leaders

If you're a CPO, CEO, or VP of Product using DORA metrics to evaluate engineering health: the numbers your team shows you in 2026 are the most misleading they've ever been. Not because engineers are gaming them — because AI made the underlying assumptions obsolete without anyone changing the dashboard.

Before your next engineering review, ask:

1. What's our AI code share trending over time?
If they don't track it, you don't have a quality story — you have a throughput story.

2. How are we tracking review quality for AI-generated PRs?
"We review everything" is not an answer. Volume + velocity kills review quality. Ask what the gate is.

3. What percentage of recent production incidents involved code written more than two weeks before the incident?
This is the latent defect question. If they've never looked at it, they don't know their actual change failure rate.

If all three answers are "we don't track that," your DORA Elite ranking is a liability disguised as an achievement.


The Bottom Line

DORA metrics are not wrong. They're incomplete — and that incompleteness now has a directional bias. AI makes every DORA metric trend in the good direction while moving real risk into dimensions DORA doesn't see.

The teams getting this right aren't abandoning DORA. They're treating it as one layer of a larger stack: add AI attribution so your metrics have context, add DX Core 4 so you can measure effectiveness and not just throughput, add developer confidence signals as an early warning system, and track latent defects separately from immediate failures.

The teams getting it wrong are showing the board their best-ever numbers and calling it progress.

Those two things can both be true at the same time. Right now, for a lot of teams, they are.


Framework references: DX Core 4 (getdx.com) · 2025 DORA Report (dora.dev) · Anthropic 2026 Agentic Coding Trends Report (anthropic.com)

Work with me

I consult with engineering teams on AI adoption, cloud architecture, and engineering effectiveness. If this post surfaced a challenge you're facing, let's talk.

Get in touch →

Explore more on these topics:

Subscribe to new posts

Get an email when I publish something new. No spam, unsubscribe any time.