LeMAJ: LLM-as-Judge Framework

Updated 5 April 2026

LeMAJ is a reference-free evaluation framework for legal Q&A that decomposes LLM-generated answers into discrete legal data points.
It mirrors attorney-like reasoning by classifying assertions into Correct, Incorrect, Irrelevant, and Missing to ensure nuanced performance measurement.
The framework demonstrates strong alignment with human expert judgments, achieving higher correlation and accuracy compared to traditional metrics.

LeMAJ (“Legal LLM-as-a-Judge”) is a reference-free evaluation framework designed to assess the performance of LLMs on legal question-answering tasks. It addresses the challenges inherent in evaluating legal reasoning, particularly the need for explanations that reflect the granular, attorney-like approach adopted by human legal experts. Rather than relying on expensive and difficult-to-construct reference answers, LeMAJ decomposes LLM answers into discrete “Legal Data Points” (LDPs) and applies a classification and scoring protocol closely aligned with professional legal evaluation practice (Enguehard et al., 8 Oct 2025).

1. Formalism of the LeMAJ Framework

Given a legal document $D$ , a question $Q$ about $D$ , and an LLM-generated answer $A$ , LeMAJ operates as follows:

Decomposition: The answer $A$ is transformed into a set of $n$ self-contained assertions, or Legal Data Points:

$A \longrightarrow \{p_1, p_2, \dots, p_n\}$

where each $p_i$ corresponds to a single legal assertion or clause citation.

Classification: Each $p_i$ is assigned a label via a four-way classification function:

$f : \{p_1, \dots, p_n\} \to \{\text{Correct}, \text{Incorrect}, \text{Irrelevant}, \text{Missing}\}$

Correct: Factually accurate and relevant to $Q$ 0.
Incorrect: Contains factual error or hallucination.
Irrelevant: Factually accurate, but not pertinent to $Q$ 1.
Missing: Assertion absent from $Q$ $Q$ 2 that should have been present (critical omission).
- Metrics: After tagging, performance is assessed using:

$Q$ 3

$Q$ 4

$Q$ 5

$Q$ 6

This scheme separates factuality, relevance, and completeness within a unified protocol, enabling nuanced measurement of LLM output quality.

2. Construction and Role of Legal Data Points (LDPs)

2.1 LDP Construction

LDPs are constructed via the following steps:

An off-the-shelf LLM (e.g., Claude 3.5 Sonnet v2) is prompted with $Q$ 7 to extract all discrete assertions from $Q$ 8.
Each extracted $Q$ 9 is tagged using the four-class schema provided (Correct, Incorrect, Irrelevant, Missing).

2.2 Enabling Reference-Free Evaluation

LDPs allow LeMAJ to implement reference-free evaluation by:

Emulating the evaluation strategy of legal experts, who decompose answers into atomic assertions for judgment of correctness, relevance, and omissions.
Removing dependence on gold-standard reference answers, which are expensive to produce by qualified human annotators.
Offering explicit penalty for omissions (via Missing LDPs) without requiring human-authored ground truth.

This approach aligns LLM evaluation methodology with established legal expert practice and increases scalability.

3. Algorithmic Evaluation Workflow

LeMAJ’s reference-free evaluation methodology is structured as follows:

Segmentation: Extract LDPs from $D$ 0 using an LLM segmenter.
Classification: Tag each $D$ 1 with respect to $D$ 2 and $D$ 3 (via LLM chain-of-thought prompts) as Correct, Incorrect, Irrelevant, or Missing.
Metric Calculation: Compute cardinalities for each class:
- $D$ 4
- $D$ 5
- $D$ 6
- $D$ 7
Scoring: Derive Correctness, Precision, Recall, and $D$ 8 as defined above.

This algorithm approximates the sequence of reasoning and annotation performed by professional lawyers in assessment tasks.

4. Empirical Performance and Comparative Results

4.1 Baseline Methods

LeMAJ is evaluated against two categories of baseline methods:

Reference-based (Non-LLM) Metrics: BLEU-1–4, ROUGE-1/2/L, BERTScore, BARTScore.
Reference-free LLM-as-Judge Methods: DeepEval Answer Relevancy, DeepEval Faithfulness, DeepEval Correctness (G-Eval), DeepEval Hallucination.

4.2 Evaluation Datasets and Protocol

Proprietary Dataset: 1,000 Q&A pairs from nine contract types.
LegalBench Subset: 150 expert-curated Q&A plus 20 synthetic incorrect answers (170 total).
LLM Models: Claude 3.5 Sonnet v1 for answer generation, v2 for evaluation.
Human Ground Truth: Two lawyers provided gold-standard ratings for Correctness and Relevance on a 5-point scale, mapped to $D$ 9.

4.3 Quantitative Results

Relevance Correlation (Pearson’s $A$ 0) and Bucketed Accuracy

Method	ρ (Proprietary / LegalBench)	Bucketed Accuracy
BLEU-1	0.049 / 0.152	0.05 / 0.08
ROUGE-2	0.133 / 8.7·10⁻⁵	0.03 / 0.09
BERTScore	0.174 / 2.5·10⁻⁷	0.02 / 0.05
DeepEval: Answer Relevancy	0.000 / 0.992	0.37 / 0.45
LeMAJ ( $A$ 1)	0.370 / 1.5·10⁻²⁹	0.50 / 0.35

Correctness Correlation (Pearson’s $A$ 2) and Bucketed Accuracy

Method	ρ (Proprietary / LegalBench)	Bucketed Accuracy
BLEU-4	0.090 / 7.8·10⁻³	0.02 / 0.08
ROUGE-1	0.139 / 3.9·10⁻⁵	0.01 / 0.05
DeepEval: Correctness	0.077 / 2.4·10⁻²	0.43 / 0.24
DeepEval: Hallucination	0.080 / 1.8·10⁻²	0.04 / 0.14
LeMAJ (Correctness)	0.259 / 7.5·10⁻¹⁵	0.95 / 0.88

LeMAJ attains substantially higher correlation with human expert judgments and higher bucketed accuracy than both reference-based and LLM-as-Judge baselines, without recourse to reference answers (Enguehard et al., 8 Oct 2025).

5. Correlation with Human Judgments

LeMAJ’s scores show a strong, statistically significant concordance with expert legal annotators:

Relevance (Pearson’s ρ):
- Proprietary: 0.370 ( $A$ 3)
- LegalBench: 0.354 ( $A$ 4)
Correctness (Pearson’s ρ):
- Proprietary: 0.259 ( $A$ 5)
- LegalBench: 0.700 ( $A$ 6)

All $A$ 7-values are far below the $A$ 8 threshold, indicating robust alignment of LeMAJ metrics with professional judgments.

6. Inter-Annotator Agreement Enhancement

Manual annotation and LeMAJ-guided annotation were compared for inter-annotator agreement (IAA), measured as Cohen’s $A$ 9 and reported as mean agreement percentages:

Judgment	Manual IAA	LeMAJ-guided IAA	Δ (pp)
Correctness	0.77	0.88	+11
Relevance	0.53	0.54	+1

The increase in Correctness IAA reflects that LDP-guided segmentation and standardized tagging reduce subjectivity and variability in factuality judgments. Minimal change in Relevance IAA points to the inherent subjectivity of legal relevance annotation.

7. Open-Source LegalBench Subset and Reproducibility

7.1 Released Dataset

A subset of LegalBench has been made available, containing:

12 contract types, 150 Q&A pairs, and 20 synthetic incorrect answers.
Each sample encodes: contract text $A$ 0, question $A$ 1, LLM-generated answer $A$ 2, array of LDPs (with text, character span, tag), human scores (correctness and relevance, $A$ 3 scaling).

7.2 Format and Licensing

The data is distributed as JSONL (one object per Q&A), under an MIT license. Public repository: https://github.com/robin-ai/LeMAJ-LegalBench-subset

7.3 Experimental Replication

Load the JSONL file.
Use the LeMAJ_Evaluate protocol per $A$ 4; compare extracted LDPs and tags to released data.
Compute Pearson’s $A$ 5 and bucketed accuracy against human_score.
Optionally, employ the provided UI for human re-annotation and inter-annotator agreement experiments.

The released codebase and evaluation harness enable exact replication of the reported results and facilitate extensibility to other tasks or evaluation modalities, such as integrating alternate reasoning schemas or multi-agent jury setups.

References

LeMAJ: “LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation” (Enguehard et al., 8 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LeMAJ.

LeMAJ: LLM-as-Judge Framework

1. Formalism of the LeMAJ Framework

2. Construction and Role of Legal Data Points (LDPs)

2.1 LDP Construction

2.2 Enabling Reference-Free Evaluation

3. Algorithmic Evaluation Workflow

4. Empirical Performance and Comparative Results

4.1 Baseline Methods

4.2 Evaluation Datasets and Protocol

4.3 Quantitative Results

5. Correlation with Human Judgments

6. Inter-Annotator Agreement Enhancement

7. Open-Source LegalBench Subset and Reproducibility

7.1 Released Dataset

7.2 Format and Licensing

7.3 Experimental Replication

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

LeMAJ: LLM-as-Judge Framework

1. Formalism of the LeMAJ Framework

2. Construction and Role of Legal Data Points (LDPs)

2.1 LDP Construction

2.2 Enabling Reference-Free Evaluation

3. Algorithmic Evaluation Workflow

4. Empirical Performance and Comparative Results

4.1 Baseline Methods

4.2 Evaluation Datasets and Protocol

4.3 Quantitative Results

5. Correlation with Human Judgments

6. Inter-Annotator Agreement Enhancement

7. Open-Source LegalBench Subset and Reproducibility

7.1 Released Dataset

7.2 Format and Licensing

7.3 Experimental Replication

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research