Custom fine-tuning has moved from optimization tactic to production default
The cost and latency gap between frontier models and custom fine-tuned alternatives has grown wide enough that serious practitioners are no longer treating fine-tuning as optional. The question is no longer whether to fine-tune, but how fast to move the frontier model out of the critical path entirely.
The efficiency case for custom fine-tuning used to be theoretical. It is now measurable, repeatable, and showing up in production numbers across companies at different stages and in different verticals.
Kyle Corbitt puts the cost improvement at “order of magnitude at least” on a per-token basis, and often more. On latency, he says custom fine-tuned models can reach roughly 30% of frontier model latency while delivering similar or higher quality. That pairing, dramatically lower cost and dramatically lower latency at no quality penalty, is not a marginal gain. It is a different operational regime.
Cliff Weitzman at Speechify has lived the cost side of that argument at scale. Speechify’s inference runs at single-digit dollars per million characters. Weitzman notes that comparable output from other models costs anywhere from $30 to $100 per million characters. That is not one order of magnitude cheaper. It is closer to two. For a product built on high-volume audio generation, the difference between those numbers is not a line item. It is a business model.
Shiv Rao at Abridge offers a different kind of evidence: directional momentum inside a single company. Currently, 40% of Abridge’s model outputs come from in-house models. Rao expects that figure to reach 60% next month, driven by a distilled and fine-tuned open-source model that has already replaced a frontier model in production. The pace matters here as much as the percentage. A 20-point shift in a single month is not gradual adoption. It is a team that has found the method, confirmed the quality, and is now moving as fast as the pipeline allows.
We're talking uh you know, order of magnitude at least improvement in cost per token, um often times more than that. Kyle Corbitt
Tuhin Srivastava at Baseten frames the pattern from the infrastructure side. His observation is blunt: no Baseten customer runs vanilla open-source weights without modification. That is not a statement about a minority of sophisticated users pushing the boundary. It is a claim about the entire customer base. If it holds, custom fine-tuning at inference is no longer an advanced technique. It is the baseline expectation for anyone deploying models seriously.
What ties these four accounts together is not that they agree on a number. They don’t. Corbitt and Weitzman are describing different kinds of cost gains, in different units, for different product categories. Rao is describing an internal transition, not a benchmark. Srivastava is describing a behavioral norm across a customer base. What they share is the direction: frontier models are being pushed out of the highest-volume, most latency-sensitive, most cost-sensitive parts of production stacks, and custom fine-tuned alternatives are taking their place.
The practical implication is worth stating plainly. Frontier model pricing is, in part, a tax on teams that have not yet done the fine-tuning work. For products with stable, well-defined tasks and high inference volume, continuing to route everything through a frontier model is increasingly a choice, not a constraint. The companies that have done the distillation work are not just saving money. They are running at a structurally different cost base than competitors who have not, and the gap compounds with scale.
What to watch is whether the distillation and fine-tuning workflow becomes as standardized as containerization once did. Right now, the teams doing this well are doing it through hard-won tooling and internal expertise. If the process commoditizes, the 10x cost advantage narrows as everyone captures it. Until then, the practitioners who have already made the move hold a quiet but durable edge.