Skip to main content
← All posts
9 min read

81% Is Marketing. AI Coding Benchmarks Are Contaminated — Here's the Real Score.

SWE-bench Verified is broken. OpenAI officially stopped using it. The same models scoring 80%+ on Verified score only 23% on the contamination-resistant version. Here's what happened, why it matters, and how to actually evaluate AI coding tools.

Share

When someone tells you their AI coding tool scores 80% on SWE-bench, they're not lying. They're just quoting a number that OpenAI stopped using to evaluate their own models.

The number is real. The benchmark it measures is corrupted.

I spent the better part of last month trying to make an honest tool choice for our team. The more I dug, the more I realized that the benchmark underpinning most "Claude Code vs Copilot vs Cursor" comparisons — SWE-bench Verified — is so thoroughly contaminated that basing any purchasing decision on it is roughly equivalent to hiring someone based on an open-book exam where they wrote the textbook.

In April 2026, OpenAI quietly retired SWE-bench Verified as their primary coding eval. They didn't make a big announcement. Most of the people debating these tools on Twitter still haven't noticed.

That's worth sitting with: the company that popularized benchmark-driven model comparisons officially stopped using the benchmark everyone cites.

What SWE-bench Was — and Why It Mattered

SWE-bench, introduced by Princeton researchers in late 2023, was a genuine attempt to measure something real: can an AI actually fix bugs from production-grade codebases? It pulled from 12 Python projects — Django, Flask, Matplotlib, Scikit-learn and others — selecting real GitHub issues where a verifiable patch existed.

The "Verified" subset (2,294 tasks) was supposed to be cleaner: human-curated, confirmed that each patch genuinely resolves the issue. For roughly 18 months it was the most credible signal available for coding agent capability. Teams built tooling to track it, vendors published blog posts about it, and engineering managers referenced it in budget justifications.

The problem: those GitHub issues were public. The models were trained on the public internet. Do the math.

The Contamination Problem

Here is the mechanism, drawn out:

How SWE-bench Contamination Works

SWE-bench tasks were drawn from public GitHub repositories — the kind that get indexed, discussed on Stack Overflow, cited in papers, referenced in blog posts, and ultimately scraped into the massive training corpora used to pre-train frontier models. When a model trains on those corpora, it is not just learning to code in general. It is partially memorizing specific issue descriptions, discussion threads, and in many cases the exact patches.

At test time, the model doesn't need to reason through the problem. It needs to retrieve the answer it already saw. The benchmark, as applied to models trained on web-scale data, is measuring retrieval speed and recall quality — not the coding capability you actually care about.

The evidence isn't subtle. Researchers found that when they showed a current frontier model a short snippet from a SWE-bench task description, it could output the exact gold patch — correct class, correct method, the specific early-return condition — before doing any analysis. No chain of thought. No file exploration. Just retrieval dressed up as reasoning.

The Second Problem: Scaffold Gaming

Even if contamination were zero, there is a second distortion that makes Verified scores unreliable as a comparison tool: agent scaffolding.

SWE-bench doesn't evaluate a raw model. It evaluates a model plus its agent wrapper — the scaffolding that controls how the model reads files, plans edits, runs tests, and iterates on failures. Vendors tune this scaffold. They have a strong incentive to tune it specifically for the benchmark task structure, which is predictable: read the issue, find the relevant file, make a targeted edit, run tests.

Build an agent scaffold that excels at exactly this loop — with the right file-search heuristics, the right iteration strategy for test-failure recovery — and your score goes up without the underlying model getting any smarter at writing code.

This is why "Claude Code: 80.8% on SWE-bench Verified" is a number you should distrust twice: once for contamination, once because you're measuring Anthropic's scaffold as much as you're measuring the model. You're not seeing what the model would do dropped into your codebase with your team's workflow and your task types.

The Real Numbers

Here is what happens when you run the same frontier models on SWE-bench Pro — a contamination-resistant variant built by Scale AI using private, legally inaccessible codebases that cannot have appeared in any model's training data:

Verified vs. Pro: The 57-Point Gap

The best-performing models on SWE-bench Pro — GPT-5 and Claude Opus 4.1 — score 23.3% and 23.1% respectively. The same models score over 80% on Verified.

That is a 57-point gap.

Read that sentence again. The distance between "what vendors market" and "what the model does on code it has genuinely never seen" is 57 percentage points for the best models in the world. For other frontier models, the delta is estimated at 50 to 55 points. There is no model on the market that doesn't have a massive gap between its Verified and Pro numbers.

To be direct: these models are still impressive. A 23% score on a hard, contamination-resistant benchmark of real production bugs is genuinely difficult. The point isn't that the tools are bad. The point is that the number you've been using to compare them is wrong by about 55 points, which makes it useless as a comparison signal.

Why This Matters for Your Team's Decisions

If you're using SWE-bench Verified scores to:

  • Decide which AI coding tool to buy or recommend
  • Justify a tool subscription to your leadership
  • Compare one vendor's capability claims against another's
  • Brief a non-technical stakeholder on "which AI codes best"

...you are making decisions based on noise that correlates more with training data overlap and scaffold optimization than with how the tool will actually perform in your codebase.

The uncomfortable reality is that no one has a clean number right now. SWE-bench Pro is better, but it is still a proxy. LiveCodeBench (which samples from competitive programming problems with cutoff dates after model training) is better for measuring genuine novelty — but coding contest problems aren't production bugs either. Real production bugs involve unclear requirements, multiple interacting systems, historical context, and team conventions that no benchmark captures.

The tool that wins on benchmarks isn't always the tool that wins on your codebase.

A Framework That Actually Works

Here's the evaluation approach I've settled on, in three layers of increasing reliability:

A 3-Layer Evaluation Framework That Actually Works

Layer 1: Use SWE-bench Pro, not Verified — but treat it as a pre-filter only

If you're going to look at a public benchmark, use SWE-bench Pro (Scale AI's leaderboard). Yes, the scores look less impressive than the Verified numbers you're used to seeing. That's the point. Also worth tracking: LiveCodeBench, which structurally prevents memorization by using problems published after training cutoffs.

These numbers can tell you roughly whether a model is in the right tier. They can't tell you whether a specific tool is right for your team.

Layer 2: Build an internal benchmark from your actual backlog

This is the evaluation that actually informs the decision, and it takes one weekend.

Pull 20 real tasks from your backlog in the last 60 days — bugs, small features, refactors. Pick tasks with a clear definition of done that you can verify quickly. Run each tool you're considering on all 20 tasks. Measure:

  • Time from prompt to a PR you'd actually review — not "time to generated code," which is meaningless if the output requires hours of fixup
  • Iterations needed before the approach was right — how often did the first attempt understand the right file, the right abstraction, the right scope?
  • Failure modes — did it break something silently? Did it invent APIs that don't exist? Did it refactor something it wasn't asked to touch?

This test is grounded in your stack, your conventions, your task complexity distribution. No benchmark can replicate it.

Layer 3: Measure in production for 30 days

After you've picked a tool and shipped it to your team, look at three numbers:

Suggestion acceptance rate — track it weekly. This is your team's aggregate quality signal, quantified. If it's declining over the first month, the tool isn't fitting your workflow or codebase.

PR merge rate delta — compare AI-assisted PRs against your baseline for time-to-merge and number of review rounds. A tool that generates PRs that require three times the review cycles is a net negative regardless of how fast it wrote the code.

Post-merge bug rate — compare AI-assisted PRs against your 90-day baseline bug rate. This is the metric that engineering leadership and product management actually care about and the one that tells you whether the tool is making your software measurably better or just making it faster to write.

Most teams skip Layer 3 entirely. It's the only feedback loop that closes.

A Note on the Tools Themselves

None of this means the tools are bad. I use Claude Code daily for large-context reasoning across unfamiliar codebases — it's genuinely excellent for that. Cursor is hard to beat for IDE-native flow and fast autocomplete. Copilot remains underrated for teams that don't want to change their editor and just need a solid, affordable assistant.

The 2026 survey data suggests experienced developers average 2.3 AI tools. They're not substitutes. They have different strengths and different optimal task types. The team that uses Cursor for daily editing and Claude Code for complex multi-file refactors is not being inefficient — they've accurately matched tools to tasks.

The problem is when you pick which tool based on a benchmark that measures recall, and then wonder why your engineering velocity metrics don't match the marketing slide.

The Bottom Line

SWE-bench Verified is a contaminated test. The delta between its scores and the contamination-resistant alternative is 50 to 58 points for every frontier model. OpenAI retired it. The numbers everyone is quoting in tool comparisons are measuring how well a model retrieves answers it already encoded during training, not how well it solves novel code problems.

Use SWE-bench Pro as a rough signal. Build a small internal eval from tasks you've actually worked on. Measure production outcomes after 30 days.

The best benchmark for your team is a task from your actual backlog. Run it. Time it. Judge it.

That's the whole framework.


Sources and further reading: Scale AI SWE-bench Pro Leaderboard · OpenAI on retiring SWE-bench Verified · SWE-bench saturation analysis — AgentMarketCap · Why most LLM benchmarks mislead — dasroot.net

Work with me

I consult with engineering teams on AI adoption, cloud architecture, and engineering effectiveness. If this post surfaced a challenge you're facing, let's talk.

Get in touch →

Explore more on these topics:

Subscribe to new posts

Get an email when I publish something new. No spam, unsubscribe any time.