The training signal itself, not just scale, determines what a model can become

Modern AI has shifted from predicting the next token to learning from right/wrong answer signals via reinforcement learning. That change is not cosmetic: it governs which capabilities emerge, how much weight disruption training causes, and how compute budgets should be allocated.

By · The Editor

Nathan Labenz puts the core change plainly: modern AI models are no longer trained on next-token prediction in the way they once were. The task the model now receives is not “here is text, predict what comes next” but rather “did you get the right answer.” That shift, from a dense prediction signal applied at every token to a sparse right/wrong signal applied at the end of a sequence, is the structural change that recent technical work keeps returning to.

The weight-change comparison Kyle Corbitt draws between supervised fine-tuning and reinforcement learning is where the practical stakes become concrete. SFT, even with very few examples and a very low learning rate, produces much larger average differences in model weights than RL does. The reason is structural: when fine-tuning a smaller model on outputs distilled from a larger one, especially when the two models had different pre-training distributions, backpropagation receives a signal that says every token in the sequence needs to change. Some of those tokens the smaller model would have gotten right on its own. RL avoids this by only applying pressure where the model is actually wrong, minimizing the number of log-probability adjustments needed to reach a correct answer and leaving the rest of the model’s learned inclinations intact.

This structural difference matters beyond weight stability. Cameron Berg describes research finding that introspective awareness capabilities in language models emerge during post-training, and specifically under RL-style algorithms, but not under supervised fine-tuning. The effect appears across different RL algorithms and shows up in an idiosyncratic, method-sensitive way: RL induces it, SFT does not. That is not a marginal performance gap. It suggests the training signal itself is shaping model internals in ways that the supervised approach cannot reach, even in principle.

If you're doing SFT, it's just like even with very few examples, and even if a very very low learning rate, like it's just like throwing the the weights all to pieces, and like the average differences are so much larger than doing RL. Kyle Corbitt

The token-level credit assignment question runs beneath all of this. Eric Jang notes that current LLM reinforcement learning treats the entire output sequence as a single action, assigning reward to the whole rather than to individual steps. This is a simplification relative to classical RL, which would ideally assign credit token by token across a multi-step process. Corbitt addresses why the field landed here anyway: GRPO, the algorithm that discards token-level value modeling entirely, feels theoretically wrong. The expectation was that careful per-token credit assignment, as in PPO, was essential. The nearly ten-year gap between those two approaches reflects how long that assumption held. In practice, GRPO works. The field does not yet have a satisfying explanation for why.

Reiner Pope adds a compute-allocation frame that puts RL in a different perspective. His argument is that optimal allocation results in roughly equal token counts across pre-training, RL training, and inference. If that holds at scale, RL is not a post-processing step applied to a model trained the “real” way. It is a training regime that belongs in the same budget conversation as pre-training itself, with comparable data volumes and comparable compute weight.

Matei Zaharia offers the most operationally striking data point. He describes pipelines built entirely on open-source models, where the same model generates its own training environments and trains itself, beating frontier models at specific tasks. That result is worth reading carefully: it is not a claim about general capability, but about what RL-based self-improvement can achieve at a targeted task when the training loop is closed. The implication is that the advantages of RL post-training are accessible without proprietary infrastructure, which changes the competitive calculus for those watching the gap between open and closed models narrow.

Taken together, the evidence points toward a training paradigm where the choice of signal type, not just scale or architecture, determines what a model can become. The distinction between RL and SFT is not merely a practitioner’s preference. It appears to govern which capabilities emerge, how much disruption the training causes to existing weights, and how efficiently compute gets used. Those are foundational questions, and the answers the field is arriving at were not obvious even a few years ago.

❦

The Editor, for the readers of Signal Headquarters

The training signal itself, not just scale, determines what a model can become

From the Archive

Each piece, in your inbox.