AI coding agents still require human review, and the evidence for skipping it is getting worse

The case for removing humans from the AI coding loop keeps running into the same wall: plausible-looking output and shippable output are not the same thing. Practitioners building with these tools every day are converging on the same structural conclusion.

By · The Editor

Walden Yan puts the decay horizon on a clock: approximately two weeks. That is how long an unreviewed AI coding workflow can run before the codebase becomes unmanageable. By the end of those two weeks, he says, a task as simple as changing a button’s color breaks down because the button is implemented in ten different places. The granular example is worth sitting with. It is not a vague warning about technical debt. It is a specific failure mode that Yan reports encountering, and it has a mechanism: an AI coding agent, given no human check on its decisions, reaches for whatever patterns are already in the codebase and repeats them. Yan describes this as regression to the worst engineer, because the person who is most aggressive about using AI without auditing their output is the one whose patterns get cemented fastest and then amplified across every subsequent AI-generated change.

The review gap shows up in benchmark numbers as well. Swyx cites a Meter blog post finding that roughly 50% of SWE-bench code that passes its own benchmark test is completely unmergeable into a real production codebase. Passing the test and being shippable are not the same thing, and the gap between them is exactly where human judgment lives. That gap does not announce itself in the output. AI-generated code can look structurally sound while concealing edge-case and security problems that only deliberate scrutiny will surface.

Security is where the stakes of skipping that judgment are most visible. Aaron Levie reports a case in which AI built 80 to 90% of a feature, and the release was held up not by development time but by the requirement for a full security review. The concern was accidental code injection. That is a familiar class of vulnerability, and it is the kind of thing that looks fine at a glance and surfaces only under deliberate scrutiny. The bottleneck shifted from writing to reviewing, and Levie’s account suggests that shift is not temporary: it is baked into the way AI-assisted development currently works.

The idea of like you don't have to look at code I think is generally a bad idea. Walden Yan

The same pressure appears in open-source maintenance. Jean-Baptiste Kempf describes the position of curl maintainer Daniel Stenberg, who actively opposes AI-generated submissions on the grounds that they flood maintainers with what Stenberg calls AI slop: fake or low-quality reports and patches that increase the burden on the people responsible for keeping the software functional. The problem is not that AI cannot produce useful patches. It is that the volume and surface plausibility of bad submissions has increased faster than any maintainer’s capacity to filter them.

Cat Wu adds a temporal dimension to the review question. Reliable multi-agent code review, in her account, only became feasible with what she describes as Opus 4.5, Opus 4.6, and Sonnet 4.6. Before those models, running multiple code review agents simultaneously across an entire codebase and synthesizing a coherent set of real issues was not something teams could depend on. That is a meaningful constraint on how far automated review can substitute for human judgment, and it suggests the tooling is still catching up to the workflow demands practitioners are placing on it. Walden Yan makes the parallel point from a different angle: no single frontier model can perform full end-to-end testing of arbitrary code changes. Orchestrating multiple models together is what it takes to approximate coverage, and even that does not eliminate the need for a human at the merge gate.

Tony Fadell offers a data point that is harder to dismiss than any benchmark. Claude’s own source code, written by Claude and later leaked, was examined by professional software architects and engineers, who described it as brittle. That is a judgment rendered by practitioners with no agenda beyond calling what they saw. Nathan Labenz supplies a complementary figure: practitioners reported a 2x productivity boost from AI, but said their productivity would drop to near zero if the human were removed from the loop entirely. The multiplier is real. So is its dependence on the person doing the reviewing.

The pattern these data points describe is not a temporary limitation waiting on a better model. It is a structural feature of what AI coding tools are and how they fail. They produce plausible-looking output at speed. Plausible and correct are different things, and the difference between them is not always visible to the tool that generated the code. That gap is what human review exists to close, and the evidence from practitioners building with these tools every day suggests that gap has not narrowed to the point where closing it can be optional.

❦

The Editor, for the readers of Signal Headquarters

AI coding agents still require human review, and the evidence for skipping it is getting worse

From the Archive

Each piece, in your inbox.