RL for LLMs works without per-token credit assignment, and nobody fully understands why

Modern reinforcement learning for language models treats tens of thousands of tokens as a single action, discards fine-grained credit assignment entirely, and somehow still works. The gap between theory and practice here is wider than the field likes to admit.

By · The Editor

This week, a pointed observation about reinforcement learning and language models surfaced independently. Eric Jang, speaking with Dwarkesh Patel, and Kyle Corbitt, on the Cognitive Revolution, each arrived at the same structural point: the RL methods currently producing state-of-the-art reasoning models are, at their core, theoretically strange. The observation deserves more attention than it has received.

The specific strangeness is this. Classical reinforcement learning assigns credit at the level of individual decisions. An agent takes an action, receives a signal, and over many steps the training machinery learns which actions in which states actually mattered. In the LLM setting, a “decision” is a token, and a reasoning trace can span tens of thousands of them. Jang states the structure flatly: current LLM RL treats the entire output sequence as a single action, with a time horizon of one step. The whole generation, from first token to last, is treated as one indivisible choice.

That framing alone is striking. But Corbitt extends it into the specific mechanics of GRPO, the group-relative policy optimization method now widely used in post-training. GRPO scores a full generation and then upvotes the tokens within it that were statistically rare. The intuition is that rare tokens represent meaningful decisions rather than rote predictions. But Corbitt poses the follow-on question directly: across tens of thousands of tokens in a reasoning trace, which rare token was actually important? The answer GRPO gives is that it does not try to determine this. All rare tokens get upvoted equally when the overall score is high.

And I think that's one of the reasons why there was like an almost 10-year gap between PPO that had this value model that tried to you know, determine on a token by token basis and like GRPO where it's like hey, we're just going to throw that all away cuz it feels wrong. Kyle Corbitt

This is the part that sits uneasily. The assumption built into careful per-token credit assignment, the assumption that motivated PPO’s value model and the significant engineering overhead that came with it, is that token-level signals matter, that rewarding the right tokens and not the wrong ones is what makes training effective. GRPO discards that machinery entirely. Corbitt notes the historical record: there was nearly a decade between PPO and GRPO, a gap he attributes partly to how wrong the simpler approach seemed on paper. Something that ignores credit assignment at the token level should, by the classical logic, produce inferior models. In practice it does not.

What the evidence does not settle is why. Neither speaker offers a clean theoretical account. The possibilities that present themselves are not mutually exclusive. One is that token-level credit assignment was always less important than assumed, and that the signal from full-sequence scoring is rich enough to do the necessary work. Another is that the scale of modern training data and parameter counts compensates for what the optimizer cannot explicitly assign. A third is that the current results, while good, are leaving real capability on the table, and that a theoretically coherent per-token RL method would outperform GRPO significantly if anyone built it correctly.

The ten-year gap Corbitt describes is worth sitting with. It was not a gap caused by compute limits or data availability. It was, at least in part, a gap caused by the intuition that the simpler approach could not possibly work. The community held onto a more expensive and theoretically tidy method because the cheap alternative seemed wrong. Then someone tried the cheap alternative, and the results were competitive. That is the kind of moment that tends to be over-read as “the theory was always wrong” when the more accurate reading is “the theory was incomplete.”

Practitioners surfacing the same structural critique in the same week is not a settled verdict. It is an early sign that practitioners are beginning to name an awkwardness they have been working around. Whether the answer is that theory needs to catch up, that GRPO is a local optimum with a ceiling, or that sequence-level rewards are genuinely sufficient is an open question. The fact that it is open, and that it is starting to be asked in public, is the signal worth tracking.

❦

The Editor, for the readers of Signal Headquarters

RL for LLMs works without per-token credit assignment, and nobody fully understands why

From the Archive

The discourse, watching what you care about.