Custom and fine-tuned models are already displacing frontier models where it counts most

The economics of inference are shifting beneath the frontier model providers. Fine-tuned, distilled, and open-source models are matching or beating frontier quality at a fraction of the cost and latency, and real deployments are already reflecting that math.

By · The Editor

The cost and latency case against defaulting to frontier models is no longer theoretical. Kyle Corbitt puts the efficiency numbers plainly: fine-tuned models can bring latency down to roughly 30% of what frontier models deliver, with quality that is similar or, in most cases, higher. On cost per token, he describes the improvement as at least an order of magnitude, often more. Those figures alone would warrant attention. What makes them significant is that they are showing up in production systems, not just benchmarks.

Cliff Weitzman reports that Speechify’s inference cost sits at single-digit dollars per million characters. The comparable cost on other available models runs from $30 to $100 per million characters. That is roughly two orders of magnitude cheaper, and Speechify arrived there through its own model work rather than by waiting for frontier providers to lower prices. The gap is not an artifact of a niche use case. It reflects what happens when a team optimizes a model for a specific, high-volume task rather than paying for general-purpose capability it does not need.

The cost of getting to a custom model is also lower than the headline numbers around frontier pre-training suggest. Marc Andreessen estimates that distilling a model costs approximately 2% of the original pre-training cost. Dylan Patel supplies a striking external data point: DeepSeek achieved GPT-4-level performance at 1/600th the cost of GPT-4. Together, those figures reframe the make-versus-rent calculation. For teams running stable, high-volume workloads, the upfront investment in distillation or fine-tuning can recover itself quickly against ongoing API spend.

We can typically get uh you know, latency down to about 30% of what you get from using a frontier model with uh again similar or usually higher quality than what you were getting from the frontier model. Kyle Corbitt

Adoption is already moving. Shiv Rao reports that 40% of Abridge’s model outputs currently come from in-house models, with that share expected to reach 60% next month after the team distilled and fine-tuned an open-source replacement for a frontier model. Tuhin Srivastava observes that no Baseten customer is running vanilla open-source weights without modification. The pattern those data points describe is not a future migration. It is an ongoing one, with meaningful production share already shifted away from frontier APIs.

Latency is not a secondary concern in this calculus. Tulsee Doshi describes what happens when a model’s quality improves but its latency worsens: in live experiments, the slower model loses. The reason is structural. Asking users to wait carries a cost that quality gains cannot always offset. That dynamic makes the latency advantage of custom models more than a cost story. It is a user experience story, and the two reinforce each other.

There are real friction points. Yasser Elsaid notes that switching costs between models are not zero. Fine-tuning how a product should behave around a specific model can take three to four months. That investment is recoverable, but it represents a meaningful commitment, and it means that the choice of which model to build around has consequences that extend well past the initial deployment decision. The switching cost also implies that teams moving to in-house models are making a structural bet, not a reversible API swap.

Brendan Foody’s view of where this trajectory leads is direct: the majority of inference in five years will use open-source, custom fine-tuned, or distilled models rather than frontier ones. Public analysis from outlets covering AI infrastructure costs has reached similar conclusions, noting that fine-tuned models on stable, high-volume workloads can undercut frontier API spend by an order of magnitude while matching accuracy. The conditions that make frontier models the default choice, which are convenience, general capability, and the absence of a better-optimized alternative, weaken as teams accumulate domain-specific data and as the cost of acting on that data continues to fall. The economics are not pointing toward frontier consolidation. They are pointing the other way.

❦

The Editor, for the readers of Signal Headquarters

Custom and fine-tuned models are already displacing frontier models where it counts most

From the Archive

Each piece, in your inbox.