The Benchmark Stopped Meaning What It Used To Mean

Capability is now a dial you turn with dollars, and OpenAI's own people are saying the evals and the safety frameworks were built for a different world.

By · The Editor

For two years the AI conversation has run on a simple grammar: a new model arrives, it posts a benchmark score, the score is the capability. That grammar quietly broke this week, and the people saying so loudest were the ones who built the benchmarks.

Noam Brown said the quiet part directly: “the proper way to evaluate the models now is you either have some kind of budget for the benchmark whether it’s tokens or cost or time or whatever or you plot the performance as a function of the amount of test time compute.” A single score off a single run is no longer the unit. Capability is a curve against spend. Brown also noted you can beat prior benchmarks “by just for example scaffolding a bunch of models together,” which makes the leaderboard number a function of orchestration budget, not model quality.

Mark Chen called it from inside OpenAI: “we really are kind of in an evals crisis.” The SATs of AI, he said, are “fully saturated,” and “once an eval out in the world, then it’s it’s just already not a good” one. His structural fix, separating the teams that build evals from the teams that optimize models so you “don’t co-incentivize them,” is an admission that the current measurement apparatus is captured by the thing it measures.

The implication Brown drew is the one that should travel furthest. Speaking publicly, he said the “preparedness frameworks and responsible scaling policies, they don’t really account for the amount of test time compute. They just say, what’s the capability of the model?” If capability is a dial you turn with dollars at inference time, then a safety regime indexed to the model is indexed to the wrong variable. The governance scaffolding was built for a world where a model had a capability. It now has a capability curve.

Two other facts from the week sit on top of this nicely. Chen described frontier work as “mostly orchestration focused” now, with “the model’s great enough to do the implementation execution by itself,” meaning the marginal capability gain is coming from how you wire models together, exactly the thing benchmarks fail to price. And a markets commentator relayed Goldman Sachs warning that “consensus forecasts are underestimating the size of the AI build-out by as much as 50%,” which is what a world looks like when capability scales with spend rather than with parameter counts: the demand curve is for compute, not for cleverness.

The turn is this. The AI industry spent two years training the public, and itself, to read benchmark numbers as truth claims about models. The people running the labs have now said, on the record and in the same week, that the numbers don’t mean that anymore, the evals are saturated, and the safety frameworks were written for a regime that has already ended. Nobody has replaced any of it yet. That gap, between what we measure and what now determines capability, is the actual AI story of the week.

❦

The Editor, for the readers of Signal Headquarters

The Benchmark Stopped Meaning What It Used To Mean

From the Archive

Each piece, in your inbox.