The harness matters more than the model

Practitioners building deployed AI agents are pointing at the same bottleneck: it is not the model that constrains performance, it is the orchestration layer wrapped around it. That framing deserves attention before it becomes conventional wisdom.

By · The Editor

The bulk of the limitations in customer service AI agents come from the harness, not the underlying model. Yasser Elsaid put the figure plainly: 95% of the constraints in customer service AI agents come from the orchestration and integration layer. Andrew Lee arrived at a compatible observation from a different angle, noting that the capability gap between the best available harness and a vanilla harness has narrowed over the past year. Neither speaker was debating foundational AI research. Both were talking about what actually constrains deployed systems in production today.

The word “harness” is doing a lot of work in both accounts, and it is worth pausing on what it means in practice. A harness is not the model itself. It is everything wrapped around the model: the prompt architecture, the context management, the tool integrations, the decision logic that routes between steps, the memory systems that carry state across turns, and the scaffolding that determines when to call external systems and what to do with the results. For a customer service agent, that list is long. Each element is a potential failure point that has nothing to do with whether the underlying language model can reason well.

Elsaid’s 95% figure is a practitioner’s number, not a controlled study. It comes from direct experience building and operating deployed agents, and it should be read as such. What makes it signal-worthy is not its precision but its direction. Someone with direct operational exposure to where agents fail in production is pointing away from model capability and toward the infrastructure surrounding it. That directional claim is the thing worth tracking, not the specific percentage.

Lee’s contribution adds a temporal texture that Elsaid’s claim alone does not carry. The gap between a well-engineered harness and a vanilla harness has gotten shorter over the past year, Lee observes. This is not a statement that harness quality no longer matters. It is a statement that the field is maturing, that orchestration patterns are being codified, shared, and replicated faster than they were a year ago. The implication is that teams who are still running on minimal scaffolding are losing ground to an ecosystem that is increasingly building that scaffolding in by default.

95% of the limitation is not from the model. It's from the harness. Yasser Elsaid

Read together, the two claims sketch a picture of a capability distribution that looks different from the one most AI coverage implies. The dominant narrative treats model releases as the primary driver of what deployed AI systems can do. Benchmark scores, parameter counts, and reasoning evaluations get the attention. But if Elsaid is right about where the limiting factor actually sits, then a team shipping a well-engineered harness on a mid-tier model may consistently outperform a team running a frontier model on thin scaffolding. That is a competitive claim with real consequences for how engineering effort gets allocated.

It also reframes what “waiting for the next model” means as a strategy. Teams that treat model improvement as the primary lever for closing the gap between their current agent performance and their target are, if this framing holds, optimizing for the wrong variable. The next model release will shift some things. But if the harness accounts for the bulk of the constraint, then a model upgrade without a harness upgrade leaves most of the gap intact. The practical implication is that engineering investment in orchestration infrastructure is not a nice-to-have that follows model selection; it may be the higher-return activity.

None of this is settled. Two practitioners making related observations is a signal, not a finding. The Elsaid figure is experience-based, not benchmarked, and the domain is customer service agents specifically. Whether the 95% claim generalizes to coding agents, research agents, or multi-step enterprise workflows is an open question. Lee’s observation about the narrowing gap is also directional rather than quantified. A reader who wants controlled evidence will not find it in this corpus, and that is worth saying plainly.

What is worth watching is whether this framing migrates from practitioners to builders and eventually to buyers. If orchestration quality becomes the primary axis on which enterprise AI deployments are evaluated, that changes the market structure considerably. It shifts attention from which model a vendor uses toward how well they have built around it. Companies that have invested deeply in harness engineering accumulate a durable advantage that a competitor cannot close simply by upgrading their model subscription. The emergence of this framing in close succession from Elsaid and Lee suggests that at least some practitioners are already thinking this way.

❦

The Editor, for the readers of Signal Headquarters

The harness matters more than the model

From the Archive

The discourse, watching what you care about.