81% Is Marketing. AI Coding Benchmarks Are Contaminated — Here's the Real Score.
SWE-bench Verified is broken. OpenAI officially stopped using it. The same models scoring 80%+ on Verified score only 23% on the contamination-resistant version. Here's what happened, why it matters, and how to actually evaluate AI coding tools.