Axiome Research

The Limits of Content-Based Quality Signal

A pre-registered, confounder-aware study in software code review — and why the same structure governs financial markets.

Pre-registered · 14 experiments 2 platforms ~23,000 decisions 6 teams Oracle-backed defect target

Why this is here

The market thesis, tested somewhere markets cannot reach.

Axiome is built on a single claim: artificial intelligence cannot predict the output of a process that has no stable ground truth. Market prices are the canonical example. They are not measurements of reality — they are the aggregate of every participant's attempt to predict every other participant. A model trained on that data converges to a consensus that is, by definition, already priced in.

If that claim held only for markets, it would be an observation about finance. It is not. The same structure appears wherever an outcome is a collective judgment rather than a fact. The study summarized here demonstrates it — adversarially, and under pre-registration so the conclusion could not be reverse-engineered from the result — in a domain with no connection to markets at all: the review of software code.


The question

Can you predict acceptance from content alone?

Automated coding agents differ in cost by more than an order of magnitude. The obvious way to save money is to route cheap agents to easy work and expensive ones to hard work — which requires a predictor: given the content of a proposed change, will a team accept it, or send it back for costly rework? Crucially, for that prediction to have any routing value, its power must come from the content. It cannot come from who authored the change, who reviews it, how large it is, or which part of the system it touches — a router writes a diff; it cannot change any of those things.

The study asked, directly and adversarially, whether that content signal exists — and fixed every decision threshold in advance, so that a negative result could not later be reinterpreted as a success.


The finding

Acceptance splits into two signals that behave oppositely.

A single, robust structure emerged across all five phases of the study. The decision to accept a change decomposes into two components that point in opposite directions.

Absolute

Will this specific change clear the bar?

Governed almost entirely by context a router cannot move — author, reviewer, size, subsystem. Beyond those confounders, the content of the code adds essentially nothing.

Comparative

Which of two solutions to one task is better?

Carries a real, if modest, signal — stronger in consistent teams. But it is largely shared across teams, and never converts into the absolute prediction routing requires.

A router lives entirely in the absolute world: it must decide whether one cheaper output is acceptable, not merely rank two outputs it has already paid to generate. And in the absolute world, the signal is not there. What is predictable — who, how big, which area — is exactly the part no model can act on. The predictable and the actionable are disjoint.

Feature importance: confounders dominate, content features rank low
Who, not what. Ranked feature importance for predicting rework. The factors that matter are reviewer identity, subsystem, reviewer strictness, and author history (blue) — the context a router cannot change. Content features (red) cluster at the bottom; the strongest is commit-message length, a proxy for author diligence rather than code quality.
Predictive power appears at the confounder step and barely rises with content or quality
Where predictability comes from. Held-out accuracy as each block is added, for six teams. The jump happens when confounders are added (size → context). Adding the content of the code, and then a quality judgment of it, barely moves the line. Predictability is context-carried, not quality-driven.
Per-team quality signal beyond confounders; no team clears the pre-registered bar
The decisive null. The increment a blinded quality judge adds beyond context and content, trained on each team's own history — the setting most favorable to a content signal. Across all six teams, none clears the pre-registered bar (dashed line); most intervals include zero.
Within-task comparative signal rises from chance to above the bar in consistent teams
Where signal does live — and why it doesn't help. The one place a content signal appears: ranking two solutions to the same task. It rises from chance (surface features) to clearly positive in consistent teams. But this comparative signal is largely shared across teams and does not convert into the absolute judgment a router needs.

Why the result holds

Built to find the signal — and it still wasn't there.

The finding is structural, not a feature-engineering shortfall. Three properties rule out "better features would have found it": two independent statistical routes converge on the same null, the null reappears against a clean defect oracle and a strong quality judge, and it survives the two settings most favorable to a content signal — within-task comparison and per-team training.

The protocol

  • Pre-registered. Every threshold was fixed before analysis; none was relaxed; near-misses are reported as failures.
  • Confounder-residualized. Every content claim is measured as an increment beyond a block of factors the router cannot change — via two methods that must agree.
  • Two platforms. 17,090 merged changes across five large engineering projects, replicated on 6,042 pull requests from a second platform.
  • An objective oracle. Results re-tested against a defect target backed by reverts and fault-tracing — so the conclusion doesn't depend on a human label.
  • A blinded quality judge. A strong language model scored each change on quality dimensions, denied all outcome and identity information, with measured reliability.
  • Per-team training. Six consistent mid-sized teams, each modeled on its own history — the best case for an ownable, local signal.

The bridge to markets

A structural rhyme, not a coincidence.

Code review and markets share no mechanics, yet they share a structure. Both are consensus- and identity-driven processes whose output is a judgment produced by interacting participants. In both, the study's phrasing applies exactly:

"A process whose output is a judgment produced by interacting participants, in which the content of the judged object explains far less of the outcome than the identities and efforts of the judges."

That is the market thesis, restated in another field. Prices aggregate disagreement; acceptance aggregates reviewers. In each case a model trained on the historical record can reach real accuracy — but that accuracy is carried by structure it cannot act on, while the part it could act on carries no signal.

The constructive response is the one Axiome is built on: operate only where ground truth is automatic — questions with stable, verifiable answers — and never ask a model to infer a signal from a process that does not contain one. It is the same line that separates a profitable system from an expensive one, drawn here in a second domain.


See what this principle produces when AI and systematic methods each do only what they are suited for.

Explore the Market Study →