LLM-as-Judge Evaluators
- The paper demonstrates that Balanced Accuracy and Youden’s J uniquely capture prevalence-gap preservation critical for reliable LLM judge selection.
- LLM-as-Judge evaluators are methods where large language models automatically assess model behaviors, especially under class imbalance, for robust performance comparisons.
- Empirical case studies and simulations show that BA/J outperforms traditional metrics like Accuracy and F1, providing actionable protocols for real-world deployments.
LLMs are now widely employed as automated classifiers of model behaviors—ranging from safety violations to task performance rates—through paradigms collectively termed “LLM-as-Judge” (LaJ) evaluators. In prevalence estimation tasks that underpin core NLP benchmarks, deployability analyses, and policy setting, the statistical metric used to select and validate LLM-based judges is fundamental for credible model-to-model comparisons. The manuscript “Balanced Accuracy: The Right Metric for Evaluating LLM Judges—Explained through Youden’s J statistic” rigorously analyzes the landscape of prevalence metrics and demonstrates, both theoretically and empirically, that Balanced Accuracy (BA) and Youden’s J statistic (J) are uniquely appropriate for LaJ settings, especially under strong class imbalance. The following sections give a structured account of definitions, theoretical arguments, empirical evidence, practical protocols, and applied recommendations for the use of LaJ evaluators anchored in balanced accuracy (Collot et al., 8 Dec 2025).
1. Formal Metrics for Judge Evaluation
Let denote the number of labeled instances in a golden validation set, with , , , and denoting counts of true/false positives/negatives for a given candidate judge model. The principal classification metrics utilized in the literature are:
| Metric | Formula |
|---|---|
| Accuracy | |
| Precision | |
| Recall (TPR) | |
| Specificity (TNR) | |
| F Score | |
| Youden's | |
| Balanced Accuracy |
Balanced Accuracy and Youden’s are linearly related via .
2. Theoretical Justification: Why Balanced Accuracy and Youden’s
The core conceptual argument is that the goal of a judge in LaJ pipelines is to measure the true prevalence gap between models as faithfully as possible, independent of class balance. Let two models differ in true prevalence by . A candidate judge with sensitivity (TPR) and false-positive rate will report an apparent prevalence difference of . Therefore, (and by linearity, Balanced Accuracy) exactly captures the “scaling slope” by which the judge propagates true prevalence gaps.
Critical properties:
- Prevalence independence: TPR and TNR are conditional on the true class and invariant to the marginal class distribution.
- Class symmetry: BA/J weight both classes equally, penalizing errors on minor and major classes identically.
In contrast,
- Precision degrades catastrophically when positives are rare.
- Accuracy can be trivially maximized by always predicting the majority class.
- F, and its macro variant, ignore true negatives and can be volatile under class imbalance.
No other common metric cleanly encodes the judge's ability to preserve true between-model prevalence gaps under imbalance.
3. Empirical Evidence: Case Studies and Simulation
Empirical studies, including two real-world violation detection tasks and a Monte Carlo simulation, show BA/J reliably out-select the judge best preserving prevalence gaps.
Case Study 1: Policy Violation Detection (8.3% prevalence, )
- Judge A: Precision=0.32, Recall=0.76, Specificity=0.85, J=0.61, F=0.45, Macro-F=0.68, Accuracy=0.85, BA=0.81
- Judge B: Precision=0.41, Recall=0.57, Specificity=0.92, J=0.49, F=0.47, Macro-F=0.71, Accuracy=0.90, BA=0.75
Standard metrics (Accuracy, F, Macro-F) incorrectly select Judge B. Only BA/J rank Judge A—truth-faithful on the rare class—higher.
Case Study 2: 20% prevalence
Analogous pattern; only BA/J select the correct judge.
Simulation: 100,000 scenarios, 3 judges × 5 models ()
- Balanced Accuracy: success rate=0.752, mean rank-gap=0.033 (lowest)
- Macro-F: 0.707, 0.049
- Accuracy: 0.675, 0.067
- F: 0.617, 0.094
Selecting by Balanced Accuracy yields the highest probability of identifying the rank-faithful judge with minimal deviation when errant selections occur.
4. Practical Protocol for LaJ Evaluator Selection Using Balanced Accuracy
a. Golden-set construction: Collect a representative validation set (1,000–2,000 items) with expert/consensus labels. Class balance is ideal but not required.
b. Confusion matrix computation: For every candidate judge, compute TP, FP, TN, FN.
c. Balanced Accuracy calculation: For each candidate,
d. Threshold selection (if outputs are continuous): Maximize Youden’s on a hold-out or tuning set for optimal discrimination.
e. Selection: Rank judges by Balanced Accuracy. If the task is extremely recall-sensitive, consider inspecting TPR/FPR directly.
f. Multi-class tasks: Compute macro-averaged Balanced Accuracy:
where is the number of classes.
5. Implications for Prevalence Estimation and Model Comparison
- Balanced Accuracy (or ) is the only scalar metric that guarantees prevalence-gap preservation, a required property for downstream model comparison or evaluation release gating.
- Reporting per-class TPR/TNR or a confusion matrix with BA aids transparency and trust.
- Relying exclusively on metrics such as Accuracy, F, Precision, or Macro-F under class imbalance can cause systematic judge selection errors, leading to over- or under-estimation of actual prevalence and, correspondingly, flawed model rankings.
6. Recommendations for LLM-as-Judge (LaJ) Practice
- Adopt Balanced Accuracy (or Youden’s ) as the primary metric for any LaJ system used for prevalence estimation, particularly where class distributions are skewed or shifting.
- Always publish the full confusion matrix or per-class recall rates for any reported judge’s evaluation.
- For continuous scoring judges, tune operating thresholds to maximize on held-out validation sets (i.e., ROC curve optimization).
- For safety-critical or imbalanced settings, avoid defaulting to legacy metrics that may mask bias or dampen prevalence resolution.
- In large-scale deployments, ensure that golden-set size is sufficient (up to labels); marginal returns diminish beyond this point.
By grounding model selection in Balanced Accuracy, researchers and practitioners align judge choice with theoretically justified, empirically robust measures that guarantee faithful, prevalence-independent comparisons in LLM benchmarking and risk assessment pipelines (Collot et al., 8 Dec 2025).