LLM-as-a-Judge Criteria Review

Updated 22 December 2025

LLM-as-a-Judge Criteria are clearly defined guidelines that outline API cost, reliability, flexibility, bias, and consistency for automated model evaluations.
The framework formalizes metrics and methodologies, including program-based judging and domain-specific adaptations, to ensure cost efficiency and robust performance.
It integrates comprehensive taxonomies and quantitative measures to benchmark LLM outputs, influencing best practices in reliable and unbiased automated judgments.

LLMs are increasingly deployed as automatic judges of model generations and natural language outputs. The design, benchmarking, and verification of “LLM-as-a-Judge” (LLMaaJ) systems have triggered a technical literature that formalizes key criteria to guide construction, evaluation, and deployment. Core evaluation axes address cost, reliability, flexibility, bias, and consistency of judgment. The following is a comprehensive review of the established LLM-as-a-Judge criteria, their mathematical formalizations, evaluation methodologies, and associated best practices.

1. Fundamental Criteria of LLM-as-a-Judge Systems

The canonical axes for analyzing and benchmarking LLM-as-a-Judge systems, as codified in “Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation,” are fivefold: API cost, reliability, pipeline flexibility, bias, and consistency (Huang et al., 12 Jun 2025).

Criterion	Definition	Core Formula(s)/Metric
API Cost	Aggregate monetary cost for LLM inference over N pairwise judgments	Cost₍LLM₎ = N × C₍LLM₎
Reliability	Adherence to evaluation rubric; modeled as labeler ensemble weights	Reliability weights θᵢ via weak supervision
Flexibility	Ease of rubric/code/metric modification without full LLM re-eval	Qualitative: dimensionally adaptive
Bias	Systematic skew due to position, gender, formatting, reference cues	Consistency, biased-response win rate
Consistency	Stability of judgments under minimal perturbations (e.g., swap order)	Consistency = (T−F)/T × 100%

API Cost: For N pairs and per-query price C₍LLM₎,

$\text{Cost}_{LLM} = N \times C_{LLM}$

In the program-based judge setting with M programs (each at C₍prog₎ cost), cost falls to $M \times C_{prog} \ll N \times C_{LLM}$ .

Reliability: Each code-based judge (program) acts as a noisy labeler λᵢ, with preference λ̄ᵢ ∈ {+1, −1}. A weak supervision model learns reliability weights θᵢ via the marginal likelihood over ground-truth preference Y: $\Pr(\bar\lambda_1...\bar\lambda_m|Y) \propto \exp \left({ -\sum_{i=1}^m \theta_i \cdot \bar\lambda_i \cdot Y }\right)$

Flexibility: Pipeline flexibility is operationalized through three mechanisms: criterion swap (editing code blocks for “Structure,” “Relevance,” etc.), threshold tuning (local numeric threshold/weight revision), and new metrics addition (e.g., code calls for embeddings).

Bias Metrics: Evaluated along sub-dimensions:

Consistency = fraction of judgments unchanged after targeted perturbation
Biased-response win rate = fraction artificially favored answer wins
Bias reduction: $\mathrm{Bias\_Reduction} = \frac{\mathrm{WinRate}_{LLM} - \mathrm{WinRate}_{PAJAMA}}{\mathrm{WinRate}_{LLM}} \times 100\%$

Consistency:

$\mathrm{Consistency} = \frac{T-F}{T} \times 100\%$

$\Delta_{\mathrm{Consistency}} = \frac{ \mathrm{Consistency}_{PAJAMA} - \mathrm{Consistency}_{LLM}}{ \mathrm{Consistency}_{LLM} } \times 100\%$

(Example: 48.36% (LLM baseline) to 64.19% (PAJAMA), a +15.83 pp gain.)

2. Taxonomy of Judgment Attributes, Rubrics, and Metrics

The "From Generation to Judgment" taxonomy (Li et al., 25 Nov 2024) synthesizes six core judgment axes:

Attribute	Definition/Typical Metrics	Example Metrics
Helpfulness	Utility/informativeness	1–5/0–1 scale, human accuracy, ρ
Safety	Avoidance of harmful/toxic content	Refusal rate, policy compliance, F1
Reliability	Factual faithfulness, calibrated outputs	Factuality precision/recall, MSE
Relevance	Alignment to prompt/query	Top-k accuracy, pairwise win rate
Logic	Internal consistency, reasoning	Step-verification pass rate
Overall Quality	Holistic aggregation	Mean/median aspect scores, Cohen's κ

Rubrics may be operationalized as:

Score-based: Assign Sᵢ ∈ ℝ (or Likert, binary)
Ranking-based: Linear/weak ordering on candidate outputs
Pairwise selection: Discriminative (win/tie/loss)

Composite scoring functions often take the form: $\text{Score}(x, y) = \frac{1}{K} \sum_{i=1}^K R_i(x, y)$ where R_i(x, y) is the score along axis i.

3. Formalization of Bias, Consistency, and Robustness

Quantitative bias studies (CALM framework) (Ye et al., 3 Oct 2024) formalize 12 biases; primary metrics include:

Robustness Rate (RR):

$\mathrm{RR} = \frac{1}{|D|} \sum_{i=1}^{|D|} \mathbb{I}(y_i = \hat y_i)$

where $\hat y_i$ is the post-perturbation judgment.

Consistency Rate (CR):

$\mathrm{CR} = \frac{1}{|D|} \sum_{i=1}^{|D|} \mathbb{I}(y_i = y_i^\mathrm{rand})$

Self-Enhancement Error Rate:

$\mathrm{ErrorRate}_{SE} = \left| 1 - \frac{y_{\mathrm{self}}}{y_{\mathrm{other}}} \right|$

Positional bias metrics (Shi et al., 12 Jun 2024):

Repetitional Consistency (RC):

$\mathrm{RC} = \frac{1}{n} \sum_{j=1}^n \frac{ \max(|c_1^j|, |c_2^j|) }{ t_j }$

Positional Consistency (PC):

$\mathrm{PC} = \frac{1}{n} \sum_{j=1}^n \mathbf{1}((c_{JO}^j, c_{JP}^j) \in V)$

Positional Fairness (PF): Normalized recency/primacy bias

These metrics calibrate judge selection and prompt engineering (e.g., swap order, randomize presentation).

4. Complex Rubric Construction and Domain Adaptation

Tasks with unique industry constraints (e.g., legal or mathematical reasoning) entail nuanced rubric design.

Legal LLM-as-a-Judge (LeMAJ) (Enguehard et al., 8 Oct 2025):

Atomic unit: Legal Data Point (LDP) = (assertion_text, citation_reference)
Tagging: <Correct>, <Incorrect>, <Irrelevant>, <Missing>
Metrics:

$\begin{align*} \mathrm{Correctness} &= \frac{N_C}{N_C+N_I} \ \mathrm{Precision} &= \frac{N_C}{N_C+N_R} \ \mathrm{Recall} &= \frac{N_C}{N_C+N_M} \ F_1 &= \frac{2\,\text{Precision}\,\times\,\text{Recall}}{\text{Precision} + \text{Recall}} \end{align*}$

LeMAJ’s procedure reflects a decompositional approach, segmenting legal responses, tagging LDPs, and aggregating scores to closely match human expert review.

Mathematical Reasoning (EFG, Epistemic Ensemble) (Zhang et al., 12 Jun 2025):

Four core scores: Logical Preservation (LP), Mathematical Consistency (MC), Formal Validity (FV), Formal Quality (FQ).
OAP (Operable Atomic Properties): Each dimension is an average over true/false binary signals for subproperties, e.g., for LP:

$S_{\mathrm{LP}}(s, \phi) = \frac{1}{|\mathcal{J}_{\mathrm{LP}}|} \sum_{j \in \mathcal{J}_{\mathrm{LP}}} OAP_j(s, \phi)$

Final aggregation: Weighted linear combination of dimension scores, weights fit via constrained regression to human labels.

5. Judgment Reliability: Consistency, Local/Global Coherence

The Sage evaluation suite (Feng et al., 17 Dec 2025) establishes two axiom-driven, annotation-free robustness metrics:

Local Self-Consistency (Intra-Pair Instability, IPI):

$\mathrm{IPI}(Q) = \frac{1}{\binom{n}{2}} \sum_{1 \le i < j \le n} \mathbb{I}(y_{ij} \neq -y_{ji})$

Global Logical Consistency (Total Order Violations, TOV):

$\mathrm{TOV}(Q) = \min_{O \in \mathcal{O}_n} \sum_{i \neq j} \mathbb{I}(y_{ij} \neq p_{ij})$

IPI and TOV, motivated by rational-choice axioms (completeness, asymmetry, transitivity), provide rigorous consistency diagnostics.

6. Program-Based Judging and Pipeline Adaptation

PAJAMA (Huang et al., 12 Jun 2025) pioneers a program-synthesis approach: LLMs synthesize scoring programs implementing user rubrics, which are then locally executed.

Cost Reduction: $0.053 (PAJAMA) vs.$133–$184 (LLM-as-judge) per 60K examples
Consistency Gains: Mean from 48.36% (LLM) to 64.19% (+15.83 pp)
Bias Reduction: Up to −23.7% on key axes
Pipeline Flexibility: Users alter code/rubric and re-run without costly LLM inference

Practical advice: for N > 10K evaluations or >2 rubric iterations, migrate to program-based judging; monitor low-weighted judges; encode explicit anti-bias logic in code; aim for aggregate consistency >60%.

7. Best Practices and Deployment Recommendations

Across studies, convergent practice guidelines are established:

Use explicit, task-specific rubrics, at minimum describing extreme grades
Employ non-deterministic (stochastic) inference and aggregate by mean
Deploy swap/randomize order in pairwise prompts to detect and reduce position bias
Evaluate with both human-alignment (e.g., Pearson/Spearman ρ) and stability (e.g., Krippendorff’s α, IPI/TOV)
For high-stakes evaluation (e.g., legal, regulatory), require human-in-the-loop and restrict LLM judges to low-risk subtasks unless formal audit and calibration protocols are in place (Karp et al., 6 Nov 2025).
Panel-based (multi-agent) meta-judges and deep reasoning chains further improve reliability and consistency (Li et al., 23 Apr 2025).
For ambiguous or domain-ambivalent scenarios, use multi-label or distributional aggregation to avoid ground-truth collapse (Guerdan et al., 7 Mar 2025).

References

"Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation" (Huang et al., 12 Jun 2025)
"Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge" (Ye et al., 3 Oct 2024)
"LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation" (Enguehard et al., 8 Oct 2025)
"Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning" (Zhang et al., 12 Jun 2025)
"Are We on the Right Way to Assessing LLM-as-a-Judge?" (Feng et al., 17 Dec 2025)
"Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge" (Shi et al., 12 Jun 2024)
"Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments" (Li et al., 23 Apr 2025)
"Validating LLM-as-a-Judge Systems in the Absence of Gold Labels" (Guerdan et al., 7 Mar 2025)
"LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal" (Karp et al., 6 Nov 2025)
"An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability" (Yamauchi et al., 16 Jun 2025)
"Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge" (Zhang et al., 18 Feb 2025)
"Human-Centered Design Recommendations for LLM-as-a-Judge" (Pan et al., 3 Jul 2024)

These criteria and methodological principles form the state-of-the-art foundation for deploying, calibrating, and benchmarking LLM-as-a-Judge systems, ensuring cost efficiency, reliability, and alignment with desired evaluation standards across domains.