Two Ways to De-Bias an LLM-as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport

Published 9 May 2026 in cs.CL | (2605.09227v1)

Abstract: [Abridged] Using a LLM as an automatic rater (LLM-as-a-judge) is cheap but potentially biased: some judges run lenient, others strict, the middle of the scale gets compressed, and verbose answers may be over-rewarded. A common remedy is post-hoc calibration: leave the cheap judge in place and, on a modest set of paired anchors, fit a transformation from raw judge scores to an estimate of the human rating. We compare two correctors that take opposing views on how this mapping should be modeled: a parametric, small-anchor hierarchical Bayesian linear correction with per-score uncertainty, and a non-parametric Neural-ODE (FFJORD) score-transport flow. Both are run head-to-head on UltraFeedback fine-grained_score (1700 paired examples, 200 held out), with calibration split into three operational sub-questions: population-mean recovery, per-item accuracy, and distributional-shape match. The headline result is that the choice between methods is primarily a data-budget question. Both correctors close the raw $+0.71$-point mean offset to within $\pm 0.08$ of the GPT-4 reference, at 100 and at 1500 anchors. Past that, the methods swap roles. With 100 anchors, the linear corrector reconstructs the human-score distribution roughly twice as well by KL divergence (0.031 vs. 0.058) and ties the flow on MAE. With 1500 anchors the flow wins on every metric (MAE 0.320 vs. 0.359, Pearson 0.922 vs. 0.896, KL 0.026 vs. 0.037). The Bayesian linear corrector saturates well below 1500 anchors: residual $\tanh$-shaped non-linearity is, by construction, structure a linear correction cannot fit. The flow keeps improving as labels grow. We translate these findings into an explicit decision rule for production deployments.

Abstract PDF Upgrade to Chat

Authors (1)

Andrea Morandi

Summary

The paper presents a direct comparison of hierarchical Bayesian linear calibration and Neural-ODE score transport for debiasing LLM judge scores.
It demonstrates that Bayesian correction excels in low-anchor regimes with affine adjustments and uncertainty estimates, while Neural-ODE outperforms in high-anchor settings by modeling non-linear effects.
Evaluation across population-mean, per-item accuracy, and distributional shape metrics informs actionable deployment guidelines for robust, unbiased LLM evaluation.

Comparative Quantitative Evaluation of Post-hoc Debiasing Methods for LLM-as-a-Judge

Introduction

The utilization of LLMs as automated evaluators ("LLM-as-a-judge") for open-ended tasks introduces systematic and non-trivial biases in generated scores. Common artifacts include judge leniency/strictness, mid-scale compression, and content-style-dependent distortions. The calibration of these automated evaluations to recover meaningful, unbiased human-replicating scores thus becomes necessary in production QA and benchmarking workflows. The predominant solution paradigm is post-hoc bias correction: given a modest set of anchor pairs—items for which both expensive human and cheap LLM-judge scores are available—the goal is to learn an accurate mapping from judge output to calibrated scores.

This work provides an authoritative analysis of two fundamentally different calibration methods: (1) hierarchical Bayesian linear correction, and (2) Neural-ODE (FFJORD) continuous-time normalizing flow. A rigorous empirical comparison is conducted under realistic anchor budgets, using the UltraFeedback fine-grained scoring benchmark, with a suite of operationally relevant evaluation metrics. The paper establishes explicit decision boundaries for deployment-driven method selection.

Problem Formulation and Operational Metrics

Calibration Objective

Let $y \in [1,5]$ denote the high-fidelity human (or GPT-4) reference score, and $j \in [1,5]$ the LLM-judge score. The dataset consists of $n$ anchor pairs $(j_i, y_i)$ and a held-out test set for assessment. The design goal is to learn a corrector $\hat{y}(j)$ that closely recovers $y$ under realistic operational constraints.

Threefold Evaluation

The paper argues persuasively that calibration efficacy is inherently multi-dimensional, and introduces three orthogonal axes of evaluation:

Population-mean recovery: Absolute mean error between corrected and reference means (global calibration).
Per-item accuracy: Mean absolute error (MAE) and Pearson correlation over individual item corrections (local fidelity).
Distributional shape match: Symmetrized KL divergence between corrected and reference distributions (global distributional fidelity).

A key insight is that no single-axis evaluation can adequately characterize correctness, as proven with counterexamples against naive or marginal-only correctors.

Hierarchical Bayesian Linear Calibration

Model Structure and Inference

Each (judge, rubric) cell fits a 3-parameter linear Gaussian model: $y_i \sim \mathcal{N}(a + \beta j_i, \sigma^2)$ , using weakly informative priors. Hierarchy across rubrics (when available) is enforced via population priors and partial pooling, providing statistical efficiency in low-anchor regimes. Posterior parameter estimation utilizes NUTS sampling, with convergence diagnostics per [8,9].

Saturation and Expressiveness

Bayesian linear correction is maximally efficient in the small-anchor regime ( $n \sim 50$ –$100$) but is structurally limited to affine corrections. As $n \to \infty$ , the model saturates: no amount of additional data can surmount irreducible non-linear residual error. The method provides posterior quantification for model parameters, delivering actionable signals (e.g., a low slope posterior as an indicator of prompt drift).

Neural-ODE / FFJORD Score Transport

Model Architecture

A continuous-time normalizing flow transforms input judge scores into calibrated estimates via the solution to an ODE parameterized by a multi-layer perceptron. This model is fundamentally non-parametric and capable of fitting arbitrary smooth conditional means $j \in [1,5]$ 0, subsuming any complex or multi-modal relationships that defeat linear approaches. MC-dropout approximates uncertainty, and the architecture supports conditional input heads for multi-rubric deployment.

Data Regime and Expressiveness

While Neural-ODE provides superior representational power, it incurs a higher anchor requirement (typically $j \in [1,5]$ 1 or more per rubric) to exploit its capacity without overfitting. Unlike linear counterparts, the flow continues improving as anchor count increases, capturing residual non-linearity otherwise left uncorrected.

Experimental Design and Results

Benchmark and Synthetic Bias

Experiments leverage 1700 UltraFeedback anchor pairs, with reference scores as the calibration targets. The synthetic judge mechanism instantiates rigorous, realistic bias: a $j \in [1,5]$ 2 point mean offset, mid-scale compression, a tanh non-linear term irreducible by linear methods, and content-style variability.

Core Results

Mean Recovery: Both methods close the population mean gap to within $j \in [1,5]$ 3 of the reference (Q1), even with as few as 100 anchors.

Per-item Accuracy (MAE):

With $j \in [1,5]$ 4 anchors: Bayesian linear and Neural-ODE are statistically tied ( $j \in [1,5]$ 5 vs. $j \in [1,5]$ 6 MAE), but Neural-ODE achieves higher Pearson correlation ( $j \in [1,5]$ 7 vs. $j \in [1,5]$ 8).
With $j \in [1,5]$ 9: Neural-ODE decisively outperforms ( $n$ 0 vs. $n$ 1 MAE; $n$ 2 vs. $n$ 3 Pearson), capturing the synthetic non-linearity.
The Bayesian linear corrector saturates by $n$ 4; beyond that, MAE/Pearson remain identically flat.

Distributional Shape (KL):

At $n$ 5: Bayesian is superior ( $n$ 6 vs. $n$ 7).
At $n$ 8: Neural-ODE wins ( $n$ 9 vs. $(j_i, y_i)$ 0), evidencing the utility of non-linear transport in shape matching as data increases.

Multi-seed Confirmation

Fifty random seeds confirm the robustness of the saturation finding. Bayesian linear’s MAE and KL flatten beyond $(j_i, y_i)$ 1; Neural-ODE improvements with increased anchors are reliable.

Practical and Theoretical Implications

Decision Rule for Deployment

Low anchor ( $(j_i, y_i)$ 2): Hierarchical Bayesian linear is optimal—computationally efficient, robust, and delivers both point estimates and credible intervals.
High anchor ( $(j_i, y_i)$ 3): Neural-ODE is strictly superior—better per-item accuracy, higher correlation, and improved distributional shape capture, justifying higher labeling investments.
Unknown residuals: Begin with Bayesian; if residual analysis uncovers non-linear structure, only then escalate to Neural-ODE, conditional on expanded anchor collection.

System Integration

Production systems often operate across heterogeneous rubrics; an optimal portfolio implementation combines both methods, assigning linear correction to low-data rubrics and flows to high-data, anchor-rich rubrics. Metrology signals from both models assist in pipeline monitoring and drift detection.

Limitations and Future Work

While the synthetic bias regime is constructed to reflect real LLM issues, the generality to more complex, production-scale failure modes (e.g., position biases, content-dependent multimodality) is an open question. The experimental focus is single-rubric; the architectures are, however, extensible to large multi-rubric settings—for which further validation is warranted. Considerations for more principled distributional fidelity metrics (e.g., Wasserstein distances) are discussed.

Conclusion

This paper establishes a rigorous, operationally grounded comparison of hierarchical Bayesian calibration and Neural-ODE normalizing flow methods for debiasing LLM judges (2605.09227). The findings deliver a simple, actionable selection rule: employ Bayesian linear for limited anchors, and Neural-ODE/FFJORD for large anchor budgets. Both methods solve the population-mean bias to within tight tolerances, but only Neural-ODE scale to capture distributional and per-item non-linearities. These results directly inform robust toolchain design for scalable, credible automated evaluation workflows in LLM research and deployment.

Markdown Report Issue