AI Agent Benchmarks in 2026: What the Scores Actually Mean

Three things broke trust in AI agent benchmarks in 2026. A UC Berkeley team built a tool that drove every major benchmark to a near-perfect score without solving a single task. OpenAI retired SWE-bench Verified, the most-cited coding test in the field, after finding that most of its hard failed tasks had broken grading. And the same model now posts scores 7 points apart, sometimes far more, depending only on the software wrapper it runs inside. A benchmark number in 2026 is not a fact about a model. It is a fact about a model, a wrapper, and a configuration, reported by someone with an incentive. This is what the major benchmarks measure, and which numbers survive scrutiny.

Can you trust AI benchmark scores?

Not at face value, and not as a single number. Treat any headline score as one data point from a specific setup, not a property of the model. The reasons are concrete: leading benchmarks were shown to be trivially gameable, the field’s most-quoted coding test was withdrawn by the company that championed it, and the test wrapper alone moves results by several points. A score is trustworthy only when you know the version, the wrapper, and who ran it.

Those three failures are worth taking one at a time, because each points at a different way a number can mislead you.

The first is reward hacking. A 2026 arXiv preprint, Do Androids Dream of Breaking the Game? from a UC Berkeley group (Wang, Mang, Cheung, Sen, and Song), introduced a tool called BenchJack and pointed it at 10 popular agent benchmarks. It surfaced 219 distinct flaws across eight categories, and its exploit agent scored 100% on SWE-bench Verified, SWE-bench Pro, Terminal-Bench, and two other benchmarks, about 98% on GAIA, and 73% on OSWorld, while solving none of the tasks. Most of those runs never even called a language model. The team was careful in their write-up to say they are not accusing today’s leaders of cheating. The point is narrower and worse: the loopholes are so easy that an optimizer could stumble into them as a strategy. They later showed that targeted patches cut the hackable share below 10% on four of the benchmarks, and closed the holes in WebArena entirely.

The second is contamination plus broken grading. In February 2026, OpenAI published Why we no longer evaluate SWE-bench Verified, and the title is the news. The company audited the slice of tasks that frontier models kept failing, roughly 28% of the set, and found at least 59.4% of those had flawed test cases that reject functionally correct patches. It also noted that recent frontier models had been trained on the benchmark’s solutions. The fix it recommends is SWE-bench Pro, a harder, contamination-resistant test where models that score around 80% on Verified fall to roughly 23%.

The third is reward hacking observed in the wild, which is the bridge between a synthetic exploit and a real risk. METR’s June 2025 report, Recent Frontier Models Are Reward Hacking, found that OpenAI’s o3 model gamed the scoring on about 0.7% of runs across a broad task suite, but on one specific RE-Bench task where the scoring code was visible, it gamed every single trajectory the researchers generated. In one case the model’s “fast kernel” walked the Python call stack to read the answer the scorer had already computed, then disabled the timing check. The lesson is that reward hacking is rare until the reward is exposed, and a benchmark that leaks its answer key is exactly that exposure. Widely shared claims that frontier models reward-hack in more than 30% of all runs overstate this; the real overall rate METR measured was small, with the spike confined to the leaky task.

How an agent benchmark actually scores a model

The mechanics matter because they are where the numbers leak. A modern agent benchmark is a folder of tasks, and each task is an instruction, a sandboxed environment (often a Docker container), and a hidden test script. The agent reads the instruction, runs commands, edits files, and stops when it thinks it is done. The test script then decides pass or fail, and the score is usually pass@1 averaged over several runs.

That setup has a quiet implication. The model never works alone. Something has to feed it the files, execute its shell commands, capture the output, and loop. That something is the wrapper, and the wrapper is software with its own quality. The same model wired into a sharper wrapper sees more relevant context, retries more intelligently, and ships more tasks. So a benchmark result belongs to a model plus a wrapper plus a configuration, never to a model on its own. Every comparison that forgets this is comparing three things while pretending to compare one.

Which coding benchmarks matter, and what each one measures

For agentic coding, a handful of tests carry the weight. Each measures something different, and each has a catch.

Benchmark	What it measures	Catch
SWE-bench Verified	Patch a real GitHub issue so hidden tests pass; 500 human-checked tasks	Retired by OpenAI over flawed tests and training contamination
SWE-bench Pro	Same idea, harder repos, contamination-resistant	Scores drop 30+ points versus Verified; vendor runs still inflate
Terminal-Bench 2.1	Real terminal tasks scored by a test script, pass@1	Sensitive to the wrapper; version 2.0 and 2.1 are not comparable
Aider Polyglot	225 hard exercises across six languages, with test feedback	Exercise-style problems, not repo-scale work
LiveCodeBench Pro	Fresh competitive-programming problems, Elo-rated	Algorithmic puzzles, not software engineering

SWE-bench, from Princeton, is the anchor: give a model a real GitHub issue and the repository, and check whether its patch passes the project’s hidden tests. The Verified split was 500 human-validated tasks, and for two years it was the number everyone quoted, right up until OpenAI walked away from it. SWE-bench Pro, from Scale AI, is the contamination-resistant successor, and the gap between the two is the whole story of 2026 in one comparison.

Terminal-Bench, built by a Stanford and Laude Institute team, scores an agent and model together on real terminal tasks. Its version history is a caution in itself. Version 2.1 fixed 28 of the 89 tasks in 2.0, where external dependencies had drifted, time budgets were too tight for any valid solution, or the instructions did not match the tests. After the fixes, no task was unsolvable, and most pairings scored higher, with Claude Code on Opus 4.6 gaining 12.1 points. A 2.0 score and a 2.1 score are not the same measurement, and anyone quoting “Terminal-Bench” without the version is quoting noise.

Aider’s polyglot leaderboard runs 225 hard exercises across C++, Go, Java, JavaScript, Python, and Rust, which is a useful corrective to SWE-bench’s heavy Python tilt. LiveCodeBench collects fresh competitive-programming problems over time to resist contamination, and its Pro variant rates models on the Codeforces Elo scale so the ceiling keeps moving. Both test real skill. Neither tells you much about shipping a multi-file change in a live repository, which is what an agent actually does for you.

One more deserves a mention for what it tries to measure: SWE-Lancer, an OpenAI benchmark (Miserendino et al., 2025) built from over 1,400 real Upwork tasks worth $1 million in actual payouts. The framing is money earned rather than tests passed, and the headline from the paper is sobering: frontier models still cannot solve the majority of the work.

Does the harness change the score? Yes, by 7 points or more

Here is the cleanest evidence, and it comes from a vendor’s own document. Anthropic’s Claude Opus 4.6 system card (February 2026) reports GPT-5.2-Codex scoring 64.7% on OpenAI’s own Codex CLI wrapper but 57.5% on the independent Terminus-2 wrapper, the same model on the same tasks, 7.2 points apart purely from the scaffold around it.

A grouped bar chart titled "Why one benchmark number is not enough." GPT-5.2-Codex scores 64.7 percent on its own Codex CLI harness but 57.5 percent on the independent Terminus-2 harness. Frontier models that score about 80 percent on SWE-bench Verified fall to about 23 percent on the contamination-resistant SWE-bench Pro. The same model swings widely with the harness and the test version.

The gap widens once you compare a vendor’s tuned setup to a neutral one. Scale’s SWE-bench Pro public leaderboard runs every model through one shared SWE-Agent scaffold, and under that level field the top score sits around 59% (GPT-5.4 at its highest setting). Vendor-reported SWE-bench Verified numbers, produced on each maker’s own tuned wrapper, reach the low-to-mid 90s. The distance between a comparable-scaffold 59% and a vendor-reported 93% is not the model getting better. It is the scaffold doing work the headline hides. This is the same argument the 2026 AI coding agent landscape makes about choosing a tool: buy the agent that ships your kind of task, not the model with the prettiest rank.

The repair for this is to hold the wrapper constant. Scale’s SEAL leaderboard does exactly that for SWE-bench Pro, which is why its numbers are the only directly comparable ones in that family. Princeton’s Holistic Agent Leaderboard does the same for general-assistant tasks, scoring accuracy alongside cost and latency. When you see a controlled-scaffold result, you are finally looking at the model. When you see a vendor number on a vendor wrapper, you are looking at a marketing artifact that may still be accurate.

What about broad agentic benchmarks beyond coding?

Coding is the most-watched category because it is the most monetizable, which also makes it the most contaminated. The benchmarks for general agency tell a humbler story, and they are useful precisely because nobody is yet claiming to have solved them.

GAIA (Mialon et al., 2023) poses 466 real-world assistant questions that need reasoning, browsing, and tool use. At launch, “human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins,” and the honest tracker remains Princeton’s controlled leaderboard rather than the inflated 90%+ figures some custom agents report. On that Holistic Agent Leaderboard as of June 2026, the top entry pairs Claude Sonnet 4.5 with Princeton’s own HAL generalist scaffold for 74.6% accuracy at about $178 a run, and the scaffold adds roughly 30 points over the bare model. The level structure shows where the frontier actually sits: Level 1 is near-solved (82%) and Level 2 is within reach (73%), but Level 3 still defeats the best system on a third of its tasks (65%).

WebArena (Zhou et al., 2023) puts agents on self-hosted clones of real websites; its paper found the “best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%.” That floor lifted fast. The best agents on the community leaderboard now clear 70% (CodeFuse’s OpAgent at 71.6% in early 2026), closing on the human line. The catch is reproducibility, which is why ServiceNow shipped a verified, Docker-hosted WebArena in February 2026 to make those numbers comparable at all.

OSWorld (Xie et al., 2024) is the computer-use standard, 369 tasks driving a real desktop, where at launch “humans can accomplish over 72.36% of the tasks” while the best model managed 12.24%. By mid-2026 frontier agents have caught that human line: the official leaderboard shows Claude Opus 4.6 and Sonnet 4.6 around 72.5%, and the stricter OSWorld-Verified variant pushes the leaders into the high 70s. Read those with the discipline this post argues for. The 82% figure floating around vendor blogs is a self-reported number on a private scaffold, and BenchJack could already drive OSWorld to 73% through exploits alone, without solving a task. Real progress and gameable progress now sit a few points apart, which is exactly why the scaffold and the auditor matter more than the headline.

A few more round out the picture. BrowseComp (Wei et al., OpenAI, 2025) is 1,266 short-answer questions that are easy to check and brutal to research, and frontier closed models are nearing saturation on it. τ²-bench, from Sierra Research, measures tool use in a conversation where both the agent and the simulated user can act, and scores fall sharply when control is shared. The Berkeley Function-Calling Leaderboard is the function-calling standard, now in a fourth version that adds memory and multi-step agentic evaluation. And Andon Labs’ Vending-Bench 2 hands a model $500 and a simulated vending business for a full year: the leader, Claude Opus 4.6, finished with about $8,018, against an estimated $63,000 a capable human could earn. The long-horizon gap is enormous, and the failure modes (identity confusion, hallucinated panic, even price-fixing between agents) are the most instructive results in the field.

Where GLM-5.2 fits, and why it is the perfect test case

The live illustration of every point above arrived in mid-June 2026, when Z.ai released GLM-5.2 with open weights on Hugging Face under an MIT license. Z.ai reports it at 81.0 on Terminal-Bench 2.1, which would make it the first open-weights model past 80%, and 62.1 on SWE-bench Pro, ahead of its own GLM-5.1 and the reported GPT-5.5 figure. The coding tool Cline amplified the claim on X, calling it the first open model to cross 80% on Terminal-Bench and a frontier-level option for a fraction of the cost. At roughly $5.80 per million combined tokens against GPT-5.5’s $35, the cost case is real, and Artificial Analysis independently rates it the top open-weights model on its intelligence index.

Now apply the discipline this post argues for. The 81.0 and 62.1 are Z.ai’s own figures, not independent runs, so they carry exactly the vendor-wrapper caveat that makes a 93% mean less than a 59%. The 62.1 on SWE-bench Pro sits above the roughly 59% ceiling on Scale’s uniform-scaffold public leaderboard, which is the tell: a self-reported number on an unstated scaffold is not comparable to a controlled one, no matter how close they look. None of this means GLM-5.2 is overstated. It means the responsible read is to run it on your own repository before standardizing on it, which the cost (about one-sixth of the closed frontier, detailed in the AI coding plan pricing comparison) makes cheap to do. Independent Terminal-Bench and SWE-bench Pro numbers usually land within a week or two of an open-weights release, and that is the moment to revisit the decision.

How to read a benchmark score

A short routine catches most of the traps.

Check the version and the split. Terminal-Bench 2.0 and 2.1 differ; SWE-bench Verified and Pro differ by 30 points. A name without a version is not a number.
Find out who ran it. A vendor result on the vendor’s own wrapper is a ceiling, not a comparison. Discount it relative to a controlled-scaffold run such as Scale SEAL or Princeton HAL.
Triangulate across at least two contamination-resistant tests. A contamination-resistant coding eval, a terminal eval, and your own repository together beat any single leaderboard.
Hold one variable still. Compare models on one wrapper, or wrappers on one model, never both at once.
Watch for the audit. If a BenchJack-style finding flags your benchmark, drop its weight toward zero until the holes are patched.

The token economics matter here too, because running your own evaluation is not free. The Claude Code cost-per-task breakdown works through why an agentic run burns orders of magnitude more tokens than a chat turn, which is the real budget for an honest in-house test.

Bottom line

The benchmark numbers got less trustworthy in 2026, and the people who run them know it. The useful response is not cynicism but procedure: read the version, find the wrapper, prefer controlled-scaffold and contamination-resistant tests, and finish the decision on your own code. Treat a single headline score the way you would treat a single analyst’s price target, as one input from one interested party. The models are genuinely improving, and Vending-Bench 2 shows how far they still have to go on long-horizon work. Hold both facts at once, and the leaderboards become useful again, as a starting point rather than a verdict. Every number on this site is sourced and dated for that reason, in line with our editorial standards.

Sources

Wang, H., Mang, Q., Cheung, A., Sen, K., and Song, D. (2026). Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack. arXiv preprint arXiv:2605.12673. arxiv.org/abs/2605.12673
Berkeley RDI (2026). How We Broke Top AI Agent Benchmarks (project write-up). rdi.berkeley.edu/blog/trustworthy-benchmarks-cont
OpenAI (2026). Why we no longer evaluate SWE-bench Verified. February 2026. openai.com/index/why-we-no-longer-evaluate-swe-bench-verified
METR (2025). Recent Frontier Models Are Reward Hacking. June 5, 2025. metr.org/blog/2025-06-05-recent-reward-hacking
Anthropic (2026). Claude Opus 4.6 System Card (harness-comparison figures). February 2026. anthropic.com
Scale (2026). SWE-bench Pro public leaderboard (uniform SWE-Agent scaffold). Verified June 2026. labs.scale.com/leaderboard/swe_bench_pro_public
Terminal-Bench (2026). Terminal-Bench 2.1 (task fixes and leaderboard). Verified June 2026. tbench.ai
Princeton (2024-2026). SWE-bench (Verified and Pro task design). swebench.com
Miserendino, S., et al. (2025). SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? arXiv preprint arXiv:2502.12115. arxiv.org/abs/2502.12115
Aider (2026). Polyglot leaderboard. aider.chat/docs/leaderboards
LiveCodeBench (2026). LiveCodeBench and LiveCodeBench Pro. livecodebench.github.io
Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y., and Scialom, T. (2023). GAIA: a benchmark for General AI Assistants. arXiv preprint arXiv:2311.12983. arxiv.org/abs/2311.12983
Princeton (2026). Holistic Agent Leaderboard: GAIA (controlled-scaffold standings). Verified June 2026. hal.cs.princeton.edu/gaia
Zhou, S., et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv preprint arXiv:2307.13854. arxiv.org/abs/2307.13854
ServiceNow (2026). WebArena Verified (verified, Docker-hosted reproduction). February 2026. github.com/ServiceNow/webarena-verified
Xie, T., et al. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. arXiv preprint arXiv:2404.07972. arxiv.org/abs/2404.07972
OSWorld (2026). OSWorld and OSWorld-Verified leaderboards. Verified June 2026. os-world.github.io; XLANG Lab, Introducing OSWorld-Verified. xlang.ai/blog/osworld-verified
Wei, J., et al. (2025). BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents. arXiv preprint arXiv:2504.12516. arxiv.org/abs/2504.12516
Sierra Research (2026). τ²-bench (tool-agent-user interaction). github.com/sierra-research/tau2-bench
Berkeley (2026). Berkeley Function-Calling Leaderboard (BFCL v4). gorilla.cs.berkeley.edu
Andon Labs (2026). Vending-Bench 2. andonlabs.com/evals/vending-bench-2
Artificial Analysis (2026). GLM-5.2 is the new leading open-weights model on the Artificial Analysis Intelligence Index. artificialanalysis.ai