The Limits of Content-Based Prediction

The market thesis, tested somewhere markets cannot reach.

Axiome is built on a single claim: artificial intelligence cannot predict the output of a process that has no stable ground truth. Market prices are the canonical example. They are not measurements of reality — they are the aggregate of every participant's attempt to predict every other participant. A model trained on that data converges to a consensus that is, by definition, already priced in.

If that claim held only for markets, it would be an observation about finance. It is not. The same structure appears wherever an outcome is a collective judgment rather than a fact. The study summarized here demonstrates it — adversarially, and under pre-registration so the conclusion could not be reverse-engineered from the result — in a domain with no connection to markets at all: the review of software code.

Can you predict acceptance from content alone?

Automated coding agents differ in cost by more than an order of magnitude. The obvious way to save money is to route cheap agents to easy work and expensive ones to hard work — which requires a predictor: given the content of a proposed change, will a team accept it, or send it back for costly rework? Crucially, for that prediction to have any routing value, its power must come from the content. It cannot come from who authored the change, who reviews it, how large it is, or which part of the system it touches — a router writes a diff; it cannot change any of those things.

The study asked, directly and adversarially, whether that content signal exists — and fixed every decision threshold in advance, so that a negative result could not later be reinterpreted as a success.

Acceptance splits into two signals that behave oppositely.

A single, robust structure emerged across all five phases of the study. The decision to accept a change decomposes into two components that point in opposite directions.

Absolute

Will this specific change clear the bar?

Governed almost entirely by context a router cannot move — author, reviewer, size, subsystem. Beyond those confounders, the content of the code adds essentially nothing.

Comparative

Which of two solutions to one task is better?

Carries a real, if modest, signal — stronger in consistent teams. But it is largely shared across teams, and never converts into the absolute prediction routing requires.

A router lives entirely in the absolute world: it must decide whether one cheaper output is acceptable, not merely rank two outputs it has already paid to generate. And in the absolute world, the signal is not there. What is predictable — who, how big, which area — is exactly the part no model can act on. The predictable and the actionable are disjoint.

Why the result holds

Built to find the signal — and it still wasn't there.

The finding is structural, not a feature-engineering shortfall. Three properties rule out "better features would have found it": two independent statistical routes converge on the same null, the null reappears against a clean defect oracle and a strong quality judge, and it survives the two settings most favorable to a content signal — within-task comparison and per-team training.

The protocol

Pre-registered. Every threshold was fixed before analysis; none was relaxed; near-misses are reported as failures.
Confounder-residualized. Every content claim is measured as an increment beyond a block of factors the router cannot change — via two methods that must agree.
Two platforms. 17,090 merged changes across five large engineering projects, replicated on 6,042 pull requests from a second platform.
An objective oracle. Results re-tested against a defect target backed by reverts and fault-tracing — so the conclusion doesn't depend on a human label.
A blinded quality judge. A strong language model scored each change on quality dimensions, denied all outcome and identity information, with measured reliability.
Per-team training. Six consistent mid-sized teams, each modeled on its own history — the best case for an ownable, local signal.

A structural rhyme, not a coincidence.

Code review and markets share no mechanics, yet they share a structure. Both are consensus- and identity-driven processes whose output is a judgment produced by interacting participants. In both, the study's phrasing applies exactly:

"A process whose output is a judgment produced by interacting participants, in which the content of the judged object explains far less of the outcome than the identities and efforts of the judges."

That is the market thesis, restated in another field. Prices aggregate disagreement; acceptance aggregates reviewers. In each case a model trained on the historical record can reach real accuracy — but that accuracy is carried by structure it cannot act on, while the part it could act on carries no signal.

The constructive response is the one Axiome is built on: operate only where ground truth is automatic — questions with stable, verifiable answers — and never ask a model to infer a signal from a process that does not contain one. It is the same line that separates a profitable system from an expensive one, drawn here in a second domain.

The Limits of Content-Based Quality Signal