Comparative Advantage: Calibrated Pairwise Win Rates from LLM Judge Scores

Causal Judge Evaluation (CJE; Landesberg, 2025) formalizes a workflow that is already widespread in practice: score every response with a cheap LLM judge, label a small slice with a trusted oracle, learn the judge-to-oracle mapping, and evaluate at scale with calibrated uncertainty. However, calibration is only as informative as the judge's ability to discriminate: the calibration map $m(s) = E[Q | S = s]$ adds value only if $E[Q | S]$ actually varies with $s$. The public CJE Arena sample puts pressure on that assumption. The sample contains 480 oracle-labeled responses scored by a GPT-4.1-nano judge against a GPT-5 oracle. The sample has 480 oracle-labeled responses (judge: GPT-4.1-nano; oracle: GPT-5). The judge fails as a ruler: Pearson $r$ is 0.39, R² is negative, and 61% of scores are exactly 0.85. When that many scores are identical, $E[Q∣S=0.85]$ is barely more than the unconditional mean and the calibration collapses toward a constant.

The natural response is to change the target. Rather than asking the judge to place responses on a shared cardinal scale, ask it to act as a referee: which of these two responses is better? Induced pairwise margins do this without any additional judge queries. For each pair of responses i, j, define Dij = Si − Sj. Even though the individual scores are compressed, their differences retain ordinal information. Across the 111,400 oracle-comparable unordered pairs formed from the 480 labeled responses (replication code), the sign of the judge margin agrees with the oracle 73.6% of the time when non-zero; the tie-aware concordance index is 0.641, rising to 0.710 when the oracle gap exceeds 0.3. On same-prompt probe comparisons, accuracy on non-tied pairs is 84.6% (95% Wilson interval [0.665, 0.938], n = 26). The high tie rate (46.9%) is informative rather than a failure: the judge abstains when policies are close and discriminates when they differ. The pattern of comparative signal outperforming cardinal scores is consistent with findings from NLG evaluation more broadly (Liusie et al., 2024; Liu et al., 2024).

Pointwise Pairwise (induced margins)
Oracle alignment Pearson r = 0.39, R² = −0.026 Accuracy = 73.6%; AUC = 0.707
When judge is decisive 61% of scores exactly 0.85 84.6% same-prompt accuracy

The natural estimand is therefore the oracle win rate $\theta_{A,B} = \mathbb{E}[\mathbf{1}{Q(X, R_A) > Q(X, R_B)}]$: the probability policy A beats policy B on a random prompt. Gao et al. (2024) target the same estimand using Bayesian inference over a human-labeled subset; the approach here instead fits a calibration map $m(z) = \mathbb{E}[Y_{A,B} \mid Z_{A,B} = z]$ on the oracle slice and, by the law of iterated expectations, recovers $\theta_{A,B} = \mathbb{E}[m(Z_{A,B})]$ as a consistent plug-in estimator. A symmetric logistic fit ( $\hat{\beta} = 4.26$) maps judge margins to calibrated win probabilities and ensures $\theta_{A,B} + \theta_{B,A} = 1$.

The plug-in estimator $\hat{\theta}_{A,B}$ has two variance components: sampling over the unlabeled comparisons, and estimation error in $\hat{m}$ from the finite oracle slice. Standard confidence intervals account only for the first; propagating both requires jackknife or nested bootstrap over the full oracle-then-calibrate pipeline. (Naive intervals that treat judge scores as ground truth are systematically overconfident, as CJE documents in detail.)

One practical issue arises for $K ≥ 3$ policies. Pairwise win rates and mean oracle scores agree on a ranking if and only if one policy is better than another consistently across all prompt types. The same-prompt comparisons show this need not hold: the parallel-universe-prompt policy wins 60% of head-to-heads against base despite having a lower mean oracle score (diff = −0.087). The only way this can happen is if it is better than base on some prompt types and worse on others, winning enough of the former to lead in head-to-heads while losing on average. That heterogeneity is precisely what generates Condorcet cycles with a third policy: if some policy C were better than parallel-universe on the prompts where parallel-universe beats base, while losing to base overall, we would have a cycle. We cannot observe this directly here — we only have base-vs-target comparisons, not target-vs-target — but the conditions that produce cycles are demonstrably present. For K ≥ 3 policies we recommend estimating the full win-rate matrix and running a non-transitivity test before summarizing into a Bradley-Terry ranking. If the test does not reject, standard BT applies. If it does, winference provides two structured responses rather than simply falling back to the raw matrix: if the non-transitivity is driven by prompt heterogeneity — different models excelling on different task types — fit Bradley-Terry per prompt category and compose win rates for any target distribution as a weighted sum across categories; if cycles persist even after conditioning on category, use Hodge decomposition to split the win-rate matrix into a transitive (gradient) component that can be calibrated to a scalar ranking and a cyclic (curl) component that cannot, then report the curl fraction as the share of preference variance your ranking ignores.

Subscribe to Gojiberries

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe