AI models are outrunning the benchmarks built to measure them, and the gap is widening
Task-length doubling times under four months, benchmark scores jumping from 1% to 40% in a year, and evaluators openly admitting they cannot build tests long enough to challenge current models. The measurement infrastructure for AI is falling behind the thing it is supposed to measure.
The benchmark problem is no longer theoretical. Nathan Labenz, citing METR data, puts the doubling time for AI task length at under four months, which implies an 8 to 12x expansion over a single year. At that pace, the scaffolding researchers use to measure progress becomes obsolete before the next round of results is published. The people building the tests have begun to say so openly. As Labenz put it, the evaluators themselves are “really struggling to have tasks long enough to even be able to evaluate these things on.”
Brendan Foody supplies a concrete number for how fast the ceiling has moved. Frontier model scores on the Apex benchmark went from 1% to 40% in twelve months. That kind of movement does not describe incremental progress inside a stable measurement system. It describes a system being outgrown. When benchmark constructors are explicitly racing to stay ahead of the models they are supposed to grade, the benchmark is no longer functioning as a fixed reference point.
The performance data underneath those benchmark scores is striking on its own terms. Carina Hong reports that Axiom Math scored 120 out of 120 on the 2025 Putnam exam, above both DeepSeek’s best large language model score of 103 and the best human score of 110. On a separate proof benchmark, Axiom’s system solved 187 of 189 problems, a 99% completion rate, while prior models ranged from roughly 3.6% for GPT to 22% for the iterated approach, with Copa and DeepSeek Prover each in the 11 to 12% range. These are not marginal improvements over prior results. They are the kind of jumps that make the prior generation of benchmarks look like the wrong instrument entirely.
The meter people are like, 'We are really struggling to have tasks long enough to even be able to evaluate these things on.' Nathan Labenz
The security domain tells a similar story in a different register. Krishna Rao notes that Anthropic’s Mythos model found 250 security vulnerabilities in an open-source codebase where an earlier model had found 22. That is more than a tenfold increase in detected surface area. Rao also observes that, from what his team can see, frontier scaling laws are not slowing down, a direct challenge to a widely held assumption that capability gains were approaching some natural ceiling.
The Vending Bench data complicates the picture in a useful way. Axel Backlund notes that earlier models crashed out partway through Vending Bench’s simulated year, while current models now survive the full simulation. That is clear progress. But Sergiy Nesterenko adds that a strong human performer would still score roughly 10 times what current models achieve on that benchmark. So the same benchmark that shows models improving dramatically also shows that the measurement instrument has room to run and that the gap between model capability and human-level performance on complex, extended tasks is not yet closed. Both things are true at once, and that tension is precisely what makes the benchmark problem hard to solve cleanly.
What makes the saturation dynamic difficult to manage is that it is not symmetric. Building a benchmark that reliably measures frontier capability requires constructing tasks long enough, open-ended enough, and novel enough to challenge models that are themselves moving. Martin Casado observes that a given model remains relevant for only three to nine months before being superseded. The evaluators are not slow. The models are simply faster, and the window in which any fixed test set is genuinely diagnostic is narrowing.
The implications extend beyond academic measurement. If the tools for reliably upper-bounding model capability are degrading in real time, then the institutions and industries that depend on those tools to make deployment and safety decisions are working with increasingly incomplete information. A benchmark score that was meaningful six months ago may tell a buyer, a regulator, or a safety researcher very little about what the current model can actually do. That is not a problem the benchmark community can solve by working harder on the same approach. It may require a different theory of evaluation altogether, one built around dynamic, adversarially updated tasks rather than fixed test sets that models can outgrow before the results are even published.