Fine-tuned models are quietly undercutting frontier pricing by an order of magnitude

Operators and toolmakers are converging on a stark claim: purpose-built, fine-tuned models can deliver 10x or better cost reductions and roughly 70% lower latency compared to frontier models. The framing is early but the numbers are concrete enough to take seriously.

By · The Editor

A claim is starting to circulate a claim that deserves closer attention than it has received: that custom fine-tuned models can beat frontier pricing by an order of magnitude or more while running at a fraction of the latency. Neither speaker was making a theoretical argument. Both were talking about production systems.

Kyle Corbitt, speaking on Cognitive Revolution, put the general case plainly. Fine-tuned models routinely deliver order-of-magnitude improvements in cost per token, and often better than that. On latency, Corbitt said custom models can get response times down to roughly 30% of what frontier models produce, with similar or higher output quality. That is not a marginal gain. If the numbers hold across use cases, it reframes the question of when a team should reach for a frontier API versus investing in its own stack.

Cliff Weitzman offered the most striking concrete figure. At Speechify, Weitzman told Harry Stebbings, the team brought inference cost down to single-digit dollars per million characters. Comparable tasks on other available models run anywhere from $30 to $100 per million characters. He described the gap as roughly two orders of magnitude. That is a figure that changes unit economics at scale, not just at the margin.

Next month it might be 60% because we've distilled a new open source model and fine-tuned it and gotten some feedback and, you know, we are convicted and we've just replaced a frontier model with this this new in-house model. Shiv Rao

Shiv Rao added an operational data point from Abridge, a medical AI company. Rao told Stebbings that next month Abridge expects 60% of its outputs to come from an in-house model, built by distilling and fine-tuning an open-source model to replace a frontier model outright. The phrasing was telling: “we are convicted,” he said, describing a team that has tested the replacement, gathered feedback, and committed to the switch. That is a production decision, not a research posture.

The pattern across these three data points is consistent but not yet a settled case. Corbitt is speaking from the toolmaker’s perspective, where the incentive is to make fine-tuning sound tractable. Weitzman and Rao are operators who have built their own pipelines, which means their numbers reflect specific tasks, team capabilities, and investment in the process. The cost and latency figures they cite may not transfer easily to teams without comparable ML infrastructure or the volume to justify the upfront work.

Still, the direction of the signal matters. The usual argument for frontier models is that they remove the need to maintain a custom stack. The counterargument surfacing here is that at sufficient scale or quality sensitivity, the economics flip, and the cost of not fine-tuning starts to outweigh the cost of building. Weitzman’s two-order-of-magnitude gap is hard to argue past if the underlying quality holds.

What is worth watching is where this framing travels next. Both speakers were practitioners describing live systems, not forecasters describing futures. If more operators surface similar figures in the coming months, the emergence sharpens into something closer to a working playbook. For now, the claim is: the frontier model is not the default rational choice at scale. That is a narrower claim than it might appear, but a more durable one.

❦

The Editor, for the readers of Signal Headquarters

Fine-tuned models are quietly undercutting frontier pricing by an order of magnitude

From the Archive

The discourse, watching what you care about.