Axiom Math's 99% benchmark result is a signal the AI math community should take seriously
Carina Hong says Axiom Math's PandM system solved 187 of 189 problems on the CodeMarina benchmark without any modification. That figure has since been corroborated by multiple public sources. The gap between PandM and every prior model on the same benchmark is not close.
The number Carina Hong cited is specific enough to verify: 187 problems solved out of 189 on the CodeMarina benchmark, with no modification to Axiom Math’s PandM system. That is a 99% success rate. Multiple public sources, including reporting aggregated by BigGo Finance and coverage by Fortune, have since confirmed the same figure. The claim is not anecdote. It has receipts.
What makes the result legible as a signal rather than a marketing stat is the field it is measured against. Hong described the prior performance landscape in concrete terms: GPT came in around 3.6% on CodeMarina. Models described as “iterated” approaches reached roughly 22%. Copa and DeepSeek Prover both landed in the 11 to 12% range. The gap between those figures and 99% is not a marginal improvement. It is a different order of result entirely. Even the best prior competitor managed less than a quarter of PandM’s rate.
That contrast matters for how the field reads benchmarks at all. CodeMarina is a benchmark that pairs code with proof, requiring not just a working solution but a formal verification of correctness. That constraint eliminates the shortcuts that inflate code-generation scores in purely functional tests. A model that solves 3 or 4 problems in a hundred on a proof-paired benchmark and one that solves 99 are not on the same curve. They are doing categorically different things.
We actually recently with no modification to the pandm system we saw a 99% out of the 189 problems we solved 187 we missed only two um code with proof. Carina Hong
The “no modification” qualifier in Hong’s description deserves attention too. It is the kind of phrase that usually signals a team testing generalization, running an existing system against a new benchmark without tuning to its specifics. If that description holds, the result is stronger than a benchmark-optimized score would be. A system that reaches 99% on a novel evaluation without any targeted adjustment is making a broader claim about underlying capability than one that was shaped to fit the test.
Axiom Math’s position in the formal math and AI-for-math space has drawn attention from investors, as corroborated by reporting tied to a Menlo Ventures board post and surrounding coverage. That funding context does not validate the benchmark result on its own, but it does suggest that sophisticated outside observers have looked at the system and decided the underlying thesis warrants a bet. Benchmarks can be gamed; capital allocation is a slower and less forgiving signal.
The honest caveat is that CodeMarina is one benchmark, and the AI math community has watched benchmark saturation happen quickly before. A 99% score at one point in time tells you where the ceiling was when PandM ran. It does not tell you how fast the field will close the gap, or whether CodeMarina will remain a meaningful discriminator once other teams specifically target it. Those are real questions, and Hong’s result does not answer them.
What the result does establish is a credible proof of capability at a scale no prior system had demonstrated on this particular test. When the nearest competitor is at 22% and the field is clustered below that, a 99% score is not a refinement. It is a discontinuity. The math community should be asking what architectural or training choices produced it, and whether those choices transfer to problems harder than the benchmark’s current ceiling. The signal is real. The interpretation is still open.