Local AI models are already good enough for most use cases, and the gap with frontier is closing fast
A gaming GPU or a decent Mac now covers roughly 80% of what most people use cloud AI for, and the economics of inference are compounding in ways that make the remaining gap a question of months rather than years.
The threshold Greg Isenberg puts on it is simple and useful: a model running on a gaming GPU or a decent Mac is good enough for about 80% of what most people use cloud services like ChatGPT or Claude for. That is not a qualitative impression. It is a working estimate from someone who runs a local agent every 20 minutes at zero marginal cost, because local inference carries no per-call charge. The remaining 20% is real, but it no longer defines the mainstream.
The economics behind this shift are compounding from several directions at once. Marc Andreessen puts the cost of distilling an existing model at roughly 2% of the original pre-training cost. Dylan Patel notes that DeepSeek achieved GPT-4-level performance at 1/600th the cost of GPT-4. Harry Stebbings calculates the combined effect of hardware and software improvements at roughly a 10x gain in tokens per unit of money every couple of years, driven by roughly 3x improvements in raw chip performance every 18 months plus another 3x from quantization and related optimizations. Nikesh Arora’s long-term view is that token pricing will fall to one-tenth of current levels. These are not the same claim, but they point in the same direction with consistent force.
The 18-month horizon appears independently from two different vantage points. Jesse Genet predicts that local models will reach Claude Opus-level capability within that window. Joseph Nelson observes, from the deployment side, that there is roughly an 18-month lag between a capability appearing in a multimodal cloud model and being able to run that same capability on an edge device. The two framings describe opposite ends of the same pipeline: one is about when the capability arrives locally, the other is about how long it takes to get there after the frontier establishes it.
I think that majority of inference in 5 years is going to be using a open-source or custom fine-tuned or distilled model, not using a frontier Brendan Foody
Practitioners are not waiting for parity to make routing decisions. Swyx describes a setup where a self-hosted 800-million-parameter model handles field-of-study classification, keeping that query away from a frontier API entirely, on the grounds that it does not need to be a one-second latency call to a large model. On small classification tasks, swyx adds, a fine-tuned 1-billion parameter model can recover approximately 95% of a frontier model’s performance. The point is not that small models beat large ones. The point is that for a large share of production queries, they do not need to.
The cost argument will eventually reach consumers directly, not just developers. Genet frames this with a concrete projection: when AI costs for a family reach around $400 per month, there will be a meaningful uptick in adoption of local models driven by cost rather than privacy principles. That is a different kind of adoption pressure than the one usually discussed. It does not require ideological commitment to open-source. It requires only a utility bill that gets large enough to notice.
Matei Zaharia adds a finding that complicates the usual frontier-versus-open framing. Self-training pipelines built entirely from open-source models, where the same model generates training environments and trains itself, can beat models like Opus and what Zaharia called “GPT 5.5” at specific tasks. The version nomenclature in that claim is worth flagging: Zaharia’s phrasing may reflect transcript noise, and the specific model names should be read as his characterization rather than verified product labels. The underlying point, that open-source self-improvement pipelines can reach or exceed frontier performance on targeted tasks, is the claim that matters.
Brendan Foody’s five-year view is the most sweeping: the majority of inference will run on open-source, custom fine-tuned, or distilled models rather than frontier ones. That claim is directional rather than precise, and it will take years to confirm. But the near-term data points are consistent with the direction. The gap between what a consumer device can run and what a data center can run is narrowing on a measurable schedule, the cost of closing that gap is falling faster than most institutional cost models assume, and the practitioners closest to the inference layer are already routing around frontier models wherever the task allows. The infrastructure built around the assumption that frontier access is the default is not wrong yet. It is simply describing a condition that is expiring.