LeMAJ: LLM-as-Judge Framework
- LeMAJ is a reference-free evaluation framework for legal Q&A that decomposes LLM-generated answers into discrete legal data points.
- It mirrors attorney-like reasoning by classifying assertions into Correct, Incorrect, Irrelevant, and Missing to ensure nuanced performance measurement.
- The framework demonstrates strong alignment with human expert judgments, achieving higher correlation and accuracy compared to traditional metrics.
LeMAJ (“Legal LLM-as-a-Judge”) is a reference-free evaluation framework designed to assess the performance of LLMs on legal question-answering tasks. It addresses the challenges inherent in evaluating legal reasoning, particularly the need for explanations that reflect the granular, attorney-like approach adopted by human legal experts. Rather than relying on expensive and difficult-to-construct reference answers, LeMAJ decomposes LLM answers into discrete “Legal Data Points” (LDPs) and applies a classification and scoring protocol closely aligned with professional legal evaluation practice (Enguehard et al., 8 Oct 2025).
1. Formalism of the LeMAJ Framework
Given a legal document , a question about , and an LLM-generated answer , LeMAJ operates as follows:
- Decomposition: The answer is transformed into a set of self-contained assertions, or Legal Data Points:
where each corresponds to a single legal assertion or clause citation.
- Classification: Each is assigned a label via a four-way classification function:
- Correct: Factually accurate and relevant to 0.
- Incorrect: Contains factual error or hallucination.
- Irrelevant: Factually accurate, but not pertinent to 1.
- Missing: Assertion absent from 2 that should have been present (critical omission).
- Metrics: After tagging, performance is assessed using:
3
4
5
6
This scheme separates factuality, relevance, and completeness within a unified protocol, enabling nuanced measurement of LLM output quality.
2. Construction and Role of Legal Data Points (LDPs)
2.1 LDP Construction
LDPs are constructed via the following steps:
- An off-the-shelf LLM (e.g., Claude 3.5 Sonnet v2) is prompted with 7 to extract all discrete assertions from 8.
- Each extracted 9 is tagged using the four-class schema provided (Correct, Incorrect, Irrelevant, Missing).
2.2 Enabling Reference-Free Evaluation
LDPs allow LeMAJ to implement reference-free evaluation by:
- Emulating the evaluation strategy of legal experts, who decompose answers into atomic assertions for judgment of correctness, relevance, and omissions.
- Removing dependence on gold-standard reference answers, which are expensive to produce by qualified human annotators.
- Offering explicit penalty for omissions (via Missing LDPs) without requiring human-authored ground truth.
This approach aligns LLM evaluation methodology with established legal expert practice and increases scalability.
3. Algorithmic Evaluation Workflow
LeMAJ’s reference-free evaluation methodology is structured as follows:
- Segmentation: Extract LDPs from 0 using an LLM segmenter.
- Classification: Tag each 1 with respect to 2 and 3 (via LLM chain-of-thought prompts) as Correct, Incorrect, Irrelevant, or Missing.
- Metric Calculation: Compute cardinalities for each class:
- 4
- 5
- 6
- 7
- Scoring: Derive Correctness, Precision, Recall, and 8 as defined above.
This algorithm approximates the sequence of reasoning and annotation performed by professional lawyers in assessment tasks.
4. Empirical Performance and Comparative Results
4.1 Baseline Methods
LeMAJ is evaluated against two categories of baseline methods:
- Reference-based (Non-LLM) Metrics: BLEU-1–4, ROUGE-1/2/L, BERTScore, BARTScore.
- Reference-free LLM-as-Judge Methods: DeepEval Answer Relevancy, DeepEval Faithfulness, DeepEval Correctness (G-Eval), DeepEval Hallucination.
4.2 Evaluation Datasets and Protocol
- Proprietary Dataset: 1,000 Q&A pairs from nine contract types.
- LegalBench Subset: 150 expert-curated Q&A plus 20 synthetic incorrect answers (170 total).
- LLM Models: Claude 3.5 Sonnet v1 for answer generation, v2 for evaluation.
- Human Ground Truth: Two lawyers provided gold-standard ratings for Correctness and Relevance on a 5-point scale, mapped to 9.
4.3 Quantitative Results
Relevance Correlation (Pearson’s 0) and Bucketed Accuracy
| Method | ρ (Proprietary / LegalBench) | Bucketed Accuracy |
|---|---|---|
| BLEU-1 | 0.049 / 0.152 | 0.05 / 0.08 |
| ROUGE-2 | 0.133 / 8.7·10⁻⁵ | 0.03 / 0.09 |
| BERTScore | 0.174 / 2.5·10⁻⁷ | 0.02 / 0.05 |
| DeepEval: Answer Relevancy | 0.000 / 0.992 | 0.37 / 0.45 |
| LeMAJ (1) | 0.370 / 1.5·10⁻²⁹ | 0.50 / 0.35 |
Correctness Correlation (Pearson’s 2) and Bucketed Accuracy
| Method | ρ (Proprietary / LegalBench) | Bucketed Accuracy |
|---|---|---|
| BLEU-4 | 0.090 / 7.8·10⁻³ | 0.02 / 0.08 |
| ROUGE-1 | 0.139 / 3.9·10⁻⁵ | 0.01 / 0.05 |
| DeepEval: Correctness | 0.077 / 2.4·10⁻² | 0.43 / 0.24 |
| DeepEval: Hallucination | 0.080 / 1.8·10⁻² | 0.04 / 0.14 |
| LeMAJ (Correctness) | 0.259 / 7.5·10⁻¹⁵ | 0.95 / 0.88 |
LeMAJ attains substantially higher correlation with human expert judgments and higher bucketed accuracy than both reference-based and LLM-as-Judge baselines, without recourse to reference answers (Enguehard et al., 8 Oct 2025).
5. Correlation with Human Judgments
LeMAJ’s scores show a strong, statistically significant concordance with expert legal annotators:
- Relevance (Pearson’s ρ):
- Proprietary: 0.370 (3)
- LegalBench: 0.354 (4)
- Correctness (Pearson’s ρ):
- Proprietary: 0.259 (5)
- LegalBench: 0.700 (6)
All 7-values are far below the 8 threshold, indicating robust alignment of LeMAJ metrics with professional judgments.
6. Inter-Annotator Agreement Enhancement
Manual annotation and LeMAJ-guided annotation were compared for inter-annotator agreement (IAA), measured as Cohen’s 9 and reported as mean agreement percentages:
| Judgment | Manual IAA | LeMAJ-guided IAA | Δ (pp) |
|---|---|---|---|
| Correctness | 0.77 | 0.88 | +11 |
| Relevance | 0.53 | 0.54 | +1 |
The increase in Correctness IAA reflects that LDP-guided segmentation and standardized tagging reduce subjectivity and variability in factuality judgments. Minimal change in Relevance IAA points to the inherent subjectivity of legal relevance annotation.
7. Open-Source LegalBench Subset and Reproducibility
7.1 Released Dataset
A subset of LegalBench has been made available, containing:
- 12 contract types, 150 Q&A pairs, and 20 synthetic incorrect answers.
- Each sample encodes: contract text 0, question 1, LLM-generated answer 2, array of LDPs (with text, character span, tag), human scores (correctness and relevance, 3 scaling).
7.2 Format and Licensing
The data is distributed as JSONL (one object per Q&A), under an MIT license. Public repository: https://github.com/robin-ai/LeMAJ-LegalBench-subset
7.3 Experimental Replication
- Load the JSONL file.
- Use the LeMAJ_Evaluate protocol per 4; compare extracted LDPs and tags to released data.
- Compute Pearson’s 5 and bucketed accuracy against human_score.
- Optionally, employ the provided UI for human re-annotation and inter-annotator agreement experiments.
The released codebase and evaluation harness enable exact replication of the reported results and facilitate extensibility to other tasks or evaluation modalities, such as integrating alternate reasoning schemas or multi-agent jury setups.
References
LeMAJ: “LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation” (Enguehard et al., 8 Oct 2025).