LLM-as-Judge Stage: Automated Legal Evaluation
- LLM-as-Judge Stage is a paradigm that uses large language models to automatically evaluate legal responses based on detailed rubric criteria.
- It integrates data ingestion, hybrid retrieval, context-based answer generation, and JSON-based scoring to assess legal submissions.
- Empirical results show that LLM judges may overestimate quality by emphasizing formatting and style over substantive legal accuracy.
The "LLM-as-Judge" stage refers to the use of LLMs to automatically evaluate outputs generated by candidate models, functioning as fully automated surrogates for human judges. This paradigm is operationalized by prompting an LLM with a rubric, grading criteria, and the candidate response, and then extracting structured scores, often in the form of point assignments along multiple evaluation dimensions. In high-stakes domains such as law, the LLM-as-Judge stage aims to replicate human evaluators in terms of scoring, feedback, and critique granularity, but its reliability, alignment, and susceptibility to critical errors are active centers of investigation.
1. Pipeline Architecture and Evaluation Workflow
In the legal qualification context studied by Karp et al., the LLM-as-Judge stage is embedded within a multistage pipeline for automated exam grading (Karp et al., 6 Nov 2025):
- Data Ingestion and Preprocessing: Three distinct legal textual corpora—full judicial decisions, analytical summaries, and statutory commentary—are ingested, deduplicated, chunked into 256-token units, indexed (full-text and dense vector), and time-stamped.
- Hybrid Retrieval: Queries are vectorized using a multilingual legal embedding model (sdadas/mmlw-retrieval-roberta-large). Retrieval uses Typesense for both keyword and vector search over-queries (5× top-k), followed by pruning, deduplication, and reranking by weighted cosine-similarity and keyword overlap.
- Context Building and Answer Generation: Depending on experimental variant (closed-book, basic RAG, advanced RAG), models such as GPT-4.1, Claude 4 Sonnet, or Bielik-11B-v2.6-Instruct generate exam answers, optionally provided with up to 10 retrieved context chunks.
- LLM-as-Judge Stage: The exam submission (LLM-generated answer) is normalized as Markdown and passed, along with the official grading rubric and a system prompt, to a “judge” LLM (GPT-4o). The system prompt sets an expert adjudicator persona and enumerates eight grading criteria with explicit point maxima (see Table below).
| Criterion | Points (Max) |
|---|---|
| Judgment Construction | 20 |
| Legal Provisions | 30 |
| Factual Analysis | 30 |
| Formulation Style (Sentence) | 10 |
| Legal Language | 5 |
| Vocabulary | 2 |
| Clarity & Conciseness | 2 |
| Formulation Style (General) | 1 |
Judgment is returned as a strict JSON object, mapping each criterion to an integer score and a list of detected issues. The n8n orchestration platform mediates prompt delivery and JSON parsing, storing scores for batch analysis. No other symbolic rules, rule-checking, or meta-LLM processes are layered atop this core judge.
2. Judging Methodology: Prompt Construction and Execution
The LLM-judge prompt comprises,
- Persona definition ("expert legal analyst and senior adjudicator"),
- Explicit criterion block (eight criteria above, with explanations and point maxima),
- Instruction for output format (JSON only, no extraneous text),
- Exam answer (in Markdown format, as produced by the answer-generating LLM).
Hyperparameters for scoring are: default temperature (≈0 for determinism), max_tokens set for full criterion coverage, and no context beyond rubric and student answer.
Each submission is processed as:
- Send system/rubric prompt + student answer to GPT-4o.
- Receive JSON object.
- Extract each criterion score and issues.
- Append to results sheet (e.g., Google Sheet).
No additional automated grading rules, post-hoc analysis, or auxiliary model checks are performed inside the judge.
3. Scoring Metrics, Evaluation Formulas, and Comparative Analysis
While GPT-4o produces per-criterion points, evaluation focuses on the absolute or signed point difference per model:
This Δ is reported separately for substantial (Legal Provisions + Factual Analysis), stylistic (all others), and total scores. No formal statistical tests (e.g., t-tests, Wilcoxon) are provided for grading discrepancy; magnitude and direction alone are considered.
In additional tasks (information extraction, RQ3), standard classification metrics are reported for context:
- Accuracy:
- Precision:
- Recall:
- F-score:
- Cohen's : , with observed agreement, expected by chance.
4. Experimental Configuration and Grading Results
- Student models: GPT-4.1, Claude 4 Sonnet, Bielik-11B-v2.6-Instruct (each in closed-book, basic RAG, advanced RAG variants).
- Judge model: GPT-4o, system-prompted as above.
- Retrieval conditions: closed-book (no context), basic RAG (statutes only), advanced RAG (all corpora).
- Scoring: Only the written submissions are judged by GPT-4o. No corpus documents are provided to the judge during grading—only rubric and answer.
Key results ((Karp et al., 6 Nov 2025), Table 9):
- Bielik outputs: Human = 8/100, LLM = 85/100
- Claude outputs: Human = 30/100, LLM = 90/100
- GPT-4.1 outputs: Human = 37/100, LLM = 88/100
LLM-judge (GPT-4o) consistently overestimates submission quality, especially on substantive legal reasoning, compressing the point range (85–90) compared with human (8–37).
Qualitative analysis of GPT-4o's feedback revealed a tendency to penalize only surface-level features ("minor issues in section numbering," "verbose blocks"), failing to sanction fundamental legal defects (hallucinated statutes, reasoning errors) that human examiners highlighted.
5. Limitations, Failure Modes, and Lessons for Automated Legal Judging
Identified principal failure modes:
- Surface-level bias: LLM-judge overweighted formatting, structure, and verbosity, systematically neglecting accuracy of cited legal provisions or depth of factual/legal analysis.
- Over-leniency: Granted high marks even in presence of cardinal errors in legal reasoning or hallucinated legal citations.
- Inconsistent rubric adherence: Did not penalize invention, omission, or misapplication of statutes when explicitly mandated by rubric.
Recommendations for improved practice:
- Hybrid validation: Combine LLM-judge output with human spot-checks, especially for high-stakes or critical items.
- Ensemble judging: Aggregate judgments from multiple LLMs, e.g., via majority vote, to mitigate single-model surface biases.
- Expert-annotated “hard negatives”: Introduce failure patterns (e.g., hallucinations, miscitations) into judge prompt exemplars to sensitize models to domain-specific pathological errors.
- Structured auxiliary checks: Implement verifier agents to confirm every citation in answer against retrieved legal corpus before awarding points.
- Human-in-the-loop sign-off: Mandate final, human adjudication for automated legal evaluations to ensure professional-grade reliability.
6. Broader Implications for LLM-as-Judge in Law and Other Expert Domains
This legal qualification case paper demonstrates critical weaknesses of current LLM-as-Judge systems in professional, high-stakes expert domains:
- Standard LLM-judges can substantially over-score submissions due to implicit bias toward form and style at the expense of deep, expert-level validation.
- The gap between human and LLM scores on legally substantive issues highlights inherent limits of current LLMs’ ability to reason over specialized domain knowledge or recognize subtle but decisive errors.
- The findings underscore the necessity for robust, hybridized evaluation workflows that combine the scalability of LLM-based assessment with human oversight and sophisticated tooling—especially in regulated fields where misjudgment has legal, professional, or societal consequences.
Without advanced safeguards (domain-specific fine-tuning, multi-model ensembles, citation verification, systematic human audit), automated LLM-judge scores misrepresent true legal correctness and pose material risks if deployed as substitutes for qualified human adjudication (Karp et al., 6 Nov 2025).