Papers
Topics
Authors
Recent
Search
2000 character limit reached

LeMAJ: LLM-as-Judge Framework

Updated 5 April 2026
  • LeMAJ is a reference-free evaluation framework for legal Q&A that decomposes LLM-generated answers into discrete legal data points.
  • It mirrors attorney-like reasoning by classifying assertions into Correct, Incorrect, Irrelevant, and Missing to ensure nuanced performance measurement.
  • The framework demonstrates strong alignment with human expert judgments, achieving higher correlation and accuracy compared to traditional metrics.

LeMAJ (“Legal LLM-as-a-Judge”) is a reference-free evaluation framework designed to assess the performance of LLMs on legal question-answering tasks. It addresses the challenges inherent in evaluating legal reasoning, particularly the need for explanations that reflect the granular, attorney-like approach adopted by human legal experts. Rather than relying on expensive and difficult-to-construct reference answers, LeMAJ decomposes LLM answers into discrete “Legal Data Points” (LDPs) and applies a classification and scoring protocol closely aligned with professional legal evaluation practice (Enguehard et al., 8 Oct 2025).

1. Formalism of the LeMAJ Framework

Given a legal document DD, a question QQ about DD, and an LLM-generated answer AA, LeMAJ operates as follows:

  • Decomposition: The answer AA is transformed into a set of nn self-contained assertions, or Legal Data Points:

A{p1,p2,,pn}A \longrightarrow \{p_1, p_2, \dots, p_n\}

where each pip_i corresponds to a single legal assertion or clause citation.

  • Classification: Each pip_i is assigned a label via a four-way classification function:

f:{p1,,pn}{Correct,Incorrect,Irrelevant,Missing}f : \{p_1, \dots, p_n\} \to \{\text{Correct}, \text{Incorrect}, \text{Irrelevant}, \text{Missing}\}

  • Correct: Factually accurate and relevant to QQ0.
  • Incorrect: Contains factual error or hallucination.
  • Irrelevant: Factually accurate, but not pertinent to QQ1.
  • Missing: Assertion absent from QQ2 that should have been present (critical omission).
    • Metrics: After tagging, performance is assessed using:

QQ3

QQ4

QQ5

QQ6

This scheme separates factuality, relevance, and completeness within a unified protocol, enabling nuanced measurement of LLM output quality.

2.1 LDP Construction

LDPs are constructed via the following steps:

  1. An off-the-shelf LLM (e.g., Claude 3.5 Sonnet v2) is prompted with QQ7 to extract all discrete assertions from QQ8.
  2. Each extracted QQ9 is tagged using the four-class schema provided (Correct, Incorrect, Irrelevant, Missing).

2.2 Enabling Reference-Free Evaluation

LDPs allow LeMAJ to implement reference-free evaluation by:

  • Emulating the evaluation strategy of legal experts, who decompose answers into atomic assertions for judgment of correctness, relevance, and omissions.
  • Removing dependence on gold-standard reference answers, which are expensive to produce by qualified human annotators.
  • Offering explicit penalty for omissions (via Missing LDPs) without requiring human-authored ground truth.

This approach aligns LLM evaluation methodology with established legal expert practice and increases scalability.

3. Algorithmic Evaluation Workflow

LeMAJ’s reference-free evaluation methodology is structured as follows:

  1. Segmentation: Extract LDPs from DD0 using an LLM segmenter.
  2. Classification: Tag each DD1 with respect to DD2 and DD3 (via LLM chain-of-thought prompts) as Correct, Incorrect, Irrelevant, or Missing.
  3. Metric Calculation: Compute cardinalities for each class:
    • DD4
    • DD5
    • DD6
    • DD7
  4. Scoring: Derive Correctness, Precision, Recall, and DD8 as defined above.

This algorithm approximates the sequence of reasoning and annotation performed by professional lawyers in assessment tasks.

4. Empirical Performance and Comparative Results

4.1 Baseline Methods

LeMAJ is evaluated against two categories of baseline methods:

  • Reference-based (Non-LLM) Metrics: BLEU-1–4, ROUGE-1/2/L, BERTScore, BARTScore.
  • Reference-free LLM-as-Judge Methods: DeepEval Answer Relevancy, DeepEval Faithfulness, DeepEval Correctness (G-Eval), DeepEval Hallucination.

4.2 Evaluation Datasets and Protocol

  • Proprietary Dataset: 1,000 Q&A pairs from nine contract types.
  • LegalBench Subset: 150 expert-curated Q&A plus 20 synthetic incorrect answers (170 total).
  • LLM Models: Claude 3.5 Sonnet v1 for answer generation, v2 for evaluation.
  • Human Ground Truth: Two lawyers provided gold-standard ratings for Correctness and Relevance on a 5-point scale, mapped to DD9.

4.3 Quantitative Results

Relevance Correlation (Pearson’s AA0) and Bucketed Accuracy

Method ρ (Proprietary / LegalBench) Bucketed Accuracy
BLEU-1 0.049 / 0.152 0.05 / 0.08
ROUGE-2 0.133 / 8.7·10⁻⁵ 0.03 / 0.09
BERTScore 0.174 / 2.5·10⁻⁷ 0.02 / 0.05
DeepEval: Answer Relevancy 0.000 / 0.992 0.37 / 0.45
LeMAJ (AA1) 0.370 / 1.5·10⁻²⁹ 0.50 / 0.35

Correctness Correlation (Pearson’s AA2) and Bucketed Accuracy

Method ρ (Proprietary / LegalBench) Bucketed Accuracy
BLEU-4 0.090 / 7.8·10⁻³ 0.02 / 0.08
ROUGE-1 0.139 / 3.9·10⁻⁵ 0.01 / 0.05
DeepEval: Correctness 0.077 / 2.4·10⁻² 0.43 / 0.24
DeepEval: Hallucination 0.080 / 1.8·10⁻² 0.04 / 0.14
LeMAJ (Correctness) 0.259 / 7.5·10⁻¹⁵ 0.95 / 0.88

LeMAJ attains substantially higher correlation with human expert judgments and higher bucketed accuracy than both reference-based and LLM-as-Judge baselines, without recourse to reference answers (Enguehard et al., 8 Oct 2025).

5. Correlation with Human Judgments

LeMAJ’s scores show a strong, statistically significant concordance with expert legal annotators:

  • Relevance (Pearson’s ρ):
    • Proprietary: 0.370 (AA3)
    • LegalBench: 0.354 (AA4)
  • Correctness (Pearson’s ρ):
    • Proprietary: 0.259 (AA5)
    • LegalBench: 0.700 (AA6)

All AA7-values are far below the AA8 threshold, indicating robust alignment of LeMAJ metrics with professional judgments.

6. Inter-Annotator Agreement Enhancement

Manual annotation and LeMAJ-guided annotation were compared for inter-annotator agreement (IAA), measured as Cohen’s AA9 and reported as mean agreement percentages:

Judgment Manual IAA LeMAJ-guided IAA Δ (pp)
Correctness 0.77 0.88 +11
Relevance 0.53 0.54 +1

The increase in Correctness IAA reflects that LDP-guided segmentation and standardized tagging reduce subjectivity and variability in factuality judgments. Minimal change in Relevance IAA points to the inherent subjectivity of legal relevance annotation.

7. Open-Source LegalBench Subset and Reproducibility

7.1 Released Dataset

A subset of LegalBench has been made available, containing:

  • 12 contract types, 150 Q&A pairs, and 20 synthetic incorrect answers.
  • Each sample encodes: contract text AA0, question AA1, LLM-generated answer AA2, array of LDPs (with text, character span, tag), human scores (correctness and relevance, AA3 scaling).

7.2 Format and Licensing

The data is distributed as JSONL (one object per Q&A), under an MIT license. Public repository: https://github.com/robin-ai/LeMAJ-LegalBench-subset

7.3 Experimental Replication

  • Load the JSONL file.
  • Use the LeMAJ_Evaluate protocol per AA4; compare extracted LDPs and tags to released data.
  • Compute Pearson’s AA5 and bucketed accuracy against human_score.
  • Optionally, employ the provided UI for human re-annotation and inter-annotator agreement experiments.

The released codebase and evaluation harness enable exact replication of the reported results and facilitate extensibility to other tasks or evaluation modalities, such as integrating alternate reasoning schemas or multi-agent jury setups.

References

LeMAJ: “LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation” (Enguehard et al., 8 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LeMAJ.