LLM-Judge First Pass Evaluation

Updated 24 November 2025

LLM-Judge First Pass is an automated evaluation paradigm that uses large language models guided by explicit, human-refined rubrics to assess candidate outputs.
It employs a two-stage workflow—draft rubric generation followed by artifact evaluation—to enhance alignment with expert judgments and boost reliability metrics.
Empirical benchmarks show high recall, competitive precision, and significant cost savings, while also highlighting areas for improvement in rubric design and bias mitigation.

LLMs are increasingly utilized as automated evaluators—“LLM-as-a-Judge”—to scale the assessment of generated artifacts in domains such as software engineering, information retrieval, and compiler testing. The “LLM-Judge First Pass” paradigm designates the initial automated judgment step in which an LLM system, guided by explicit criteria or rubrics, renders a validity label or scalar score for a candidate output. This approach seeks to combine interpretability, agreement with expert consensus, cost-effectiveness, and rapid turnaround compared to purely manual annotation. Central to its deployment are human-in-the-loop rubric creation and prompt engineering strategies, which bolster reliability and mitigate common sources of error or bias. Empirical studies across multiple tasks demonstrate quantifiable gains in alignment with human raters and highlight actionable limitations and future research directions.

1. Foundational Framework and Workflow

The canonical LLM-Judge First Pass system comprises a two-stage workflow: rubric generation and patch (or artifact) evaluation. For automated program repair (APR), as addressed in "Towards a Human-in-the-Loop Framework for Reliable Patch Evaluation Using an LLM-as-a-Judge" (Shi et al., 14 Nov 2025), an LLM first creates a draft rubric per bug, following a structured template—root cause, requirements, and examples. This draft is refined by two human experts who correct diagnoses, generalize requirements, guarantee completeness, and scope the rubric (see pseudocode and the editing protocol).

Once a golden rubric is finalized, each candidate patch is presented to the LLM judge with full bug context and unified diff. The LLM judge then outputs a binary label (VALID/INVALID), a chain-of-thought (CoT) reasoning trace, and a concise justification. The process is summarized by:

Input: (bug description $b$ , patch $p$ , golden rubric $r_\text{gold}$ )
Prompt: Reviewer instructions, chain-of-thought, binary label, justification
Output: $v \in \{\text{VALID}, \text{INVALID}\}$ , justification $t$

Analogous approaches extend to other domains, e.g., IR relevance judgments ("LLMJudge: LLMs for Relevance Judgments" (Rahmani et al., 2024)), in which LLMs score document-query pairs on a four-point relevance scale using zero-shot or few-shot chain-of-thought prompting and compare system rankings against human gold standards.

High reliability depends critically on rubric quality and prompt design. Empirical ablations in (Shi et al., 14 Nov 2025) show that using draft rubrics ("no human edits") drops Cohen's kappa from $0.57$ to $0.38$; abandoning template structure falls further to $0.29$. Only golden, human-refined rubrics yield substantial agreement with human consensus ( $\kappa = 0.75$ and precision = $0.80$, recall = $0.94$ on clear subsets). The process includes explicit justification for every edit and cross-expert validation.

Prompt design best practices include:

Structured instructions separating reasoning steps and labeling decisions
Few-shot or chain-of-thought exemplars to scaffold LLM reasoning (per (Rahmani et al., 2024), few-shot + CoT maximizes human alignment $\kappa$ and system Kendall's $\tau$ versus zero-shot)
Explicit rubrics covering minimal failure mechanisms, technical requirements, negative examples, and constraints (e.g., "do not modify test code," "use in-class initializer")

These strategies generalize to other domains (IR, Bash code validation, compiler testing), where agent-based prompts inject external evidence and tool outputs to contextualize judgments, boosting test- and error-detection rates above $90\%$ (Sollenberger et al., 2024, Vo et al., 12 Jun 2025).

3. Key Evaluation Metrics and Quantitative Benchmarks

LLM-judge performance is assessed via inter-rater agreement statistics, classification metrics, and system-level ranking fidelity. Notable measures:

Cohen's kappa: $\kappa = \frac{p_o - p_e}{1 - p_e}$ ( $p_o$ observed agreement, $p_e$ expected chance)
Precision: $\frac{TP}{TP + FP}$
Recall: $\frac{TP}{TP + FN}$
Accuracy: $\frac{TP + TN}{TP + TN + FP + FN}$
Negative predictive value: $\frac{TN}{TN + FN}$
F1-score: $2 \cdot \text{precision} \cdot \text{recall} / (\text{precision} + \text{recall})$
Normalized edit distance for rubric refinement workflows: $\text{NormalizedEditDistance}(r, r_\text{gold}) = \frac{\mathrm{Levenshtein}(r, r_\text{gold})}{|r|}$

In (Shi et al., 14 Nov 2025), the LLM-judge achieved:

Benchmark	Cohen’s $\kappa$	Precision	Recall	Accuracy	F1-score
B_full (115)	0.57	0.65	0.93	0.78	0.77
B_clear (81)	0.75	0.80	0.94	0.87	0.87

Similarly, IR tasks report Kendall's $\tau$ ( $\geq 0.88$ on top models), nDCG@10, and detailed confusion matrices (Rahmani et al., 2024). Output-based LLM judges in SE tasks produce Pearson $r=0.81$ for translation judgments, far surpassing conventional metrics (ChrF++=$0.34$).

4. Failure Modes, Limitations, and Biases

Despite strong aggregate metrics, first-pass LLM-Judge deployments expose persistent error modes:

Overlooked rubric requirements (e.g., scope constraints not enforced)
Acceptance of patches with unnecessary, unrelated changes
Misjudgment due to template ambiguity or underspecified criteria
Binary labeling (VALID/INVALID) may oversimplify nuanced artifact evaluation
Position and style bias: judges favor output order or familiar formats (Jiang et al., 14 Jul 2025)
Data domain/concept drift: limited to sanitizer bugs in monorepos or insufficient transfer across codebases, domains, or languages

Bias-mitigation techniques involve prompt shuffling (swap order, symmetry enforcement), template diversity, explicit rejection policies for unrelated changes, and multi-judge aggregation protocols.

5. Extensions, Generalization, and Active Learning

Future directions outlined in (Shi et al., 14 Nov 2025) and related studies include:

Automating "minimal change" criteria (modularity, impact minimization)
Rubric extension to non-functional attributes (security, performance, maintainability)
Active learning frameworks where LLM judges signal low-confidence judgments for human oversight and rubric refinement
Integration with test-based, output-based, and reference-less metrics (BFM, LRV) for code artifacts (Vo et al., 12 Jun 2025)
Ensembles of multiple judge strategies or models (dynamic team selection in SE-Jury (Zhou et al., 27 May 2025)) to increase robustness and reduce single-point bias
Quantitative judge calibration via GLMs for score alignment with human reference (Sahoo et al., 3 Jun 2025)
Scaling to broader artifact types (multimodal, essay, design documentation) and supporting composite evaluation workflows

6. Representative Case Studies and Practical Applications

LLM-Judge First Pass systems have demonstrated practical viability in:

Large-scale APR patch triage with consensus-level agreement, facilitating high-throughput invalid patch filtering and candidate prioritization (Shi et al., 14 Nov 2025)
Automated test and verification of compiler implementations via coordinated prompt/evidence injection and three-stage pipelines (Sollenberger et al., 2024)
Rapid construction of relevance-judged IR corpora for benchmarking, with nearly equivalent system ranking order to human labels at substantial cost savings (Rahmani et al., 2024)
Reference-less evaluation of generated code in IT automation workflows, yielding up to $24\%$ higher refinement accuracy through agentic reflection (Vo et al., 12 Jun 2025)

Benchmarks such as CodeJudgeBench (Jiang et al., 14 Jul 2025) and SE-Jury (Zhou et al., 27 May 2025) further formalize cross-model comparisons and dynamic judge ensembling for robustness and reliability.

7. Summary and Forward Outlook

LLM-Judge First Pass frameworks, when scaffolded by carefully constructed and human-refined rubrics, achieve substantial agreement with expert consensus (Cohen’s $\kappa$ up to $0.75$), high recall ( $\geq0.93$ ), and competitive precision ( $\geq0.80$ ) in complex artifact evaluation tasks. These systems offer scalable, interpretable, and cost-efficient surrogate evaluation—but remain bounded by the quality of rubrics, domain adaptation, and persistent bias/failure modes. Ongoing advances in rubric automation, ensemble judging, active learning, and cross-domain calibration, informed by rigorous empirical benchmarks and meta-evaluation protocols, are essential to realize robust, generalizable first-pass LLM judges across scientific, engineering, and industrial applications (Shi et al., 14 Nov 2025, Rahmani et al., 2024, Sollenberger et al., 2024, Zhou et al., 27 May 2025).