TreeReview: Granular Evaluation in Legal QA
- TreeReview is a methodology that decomposes lengthy legal QA responses into discrete Legal Data Points (LDPs) for atomic, assertion-level evaluation.
- It systematically classifies LDPs into categories—correct, incorrect, irrelevant, and missing—to improve error attribution and audit transparency.
- By aligning automated scoring with expert review, TreeReview enhances correlation with human judgment and supports efficient legal QA triage.
TreeReview is a methodology designed for the granular, reference-free evaluation of LLM outputs in high-stakes legal question-answering. It centers on decomposing lengthy LLM-generated responses into discrete, self-contained units of information called Legal Data Points (LDPs) and applies systematic classification and scoring procedures that align with the review processes of expert practitioners. The approach addresses the limitations of both reference-based metrics—which rely on costly, manually generated gold standards—and monolithic “LLM-as-a-Judge” (LLM-Judge) methods, by enabling atomic, assertion-level auditability, fine-grained error attribution, and interpretable score computation without requiring reference answers (Enguehard et al., 8 Oct 2025).
1. Motivation and Evaluation Challenges in Legal QA
Evaluation of LLMs in the legal domain presents unique constraints, stemming from the domain’s requirements for precise factual accuracy, exhaustive coverage of legally material issues, and high interpretability for practitioners. Legal question-answering systems must avoid factual misstatements (to prevent legal malpractice), ensure no omissions of legally relevant points, and provide outputs that can be easily audited by human experts. Traditional reference-based metrics (e.g., ROUGE, BLEU, BERTScore) partly address factuality but depend on expensive, non-scalable gold answers authored by subject-matter experts. Generic LLM-as-a-Judge approaches, while promising in a reference-free regime, lack the granularity to detect errors of omission and tend to treat each answer as a single, undifferentiated output unit. This results in low reliability, especially in the absence of high-quality references (Enguehard et al., 8 Oct 2025).
2. Core Concepts: Legal Data Points and Scoring Formulation
The central concept in TreeReview is the Legal Data Point (LDP). An LDP is a self-contained assertion or unit of information extracted from an LLM’s response—for example, a single explicit obligation in a contract (“payment due within 45 days”) forms a distinct LDP. LDPs serve as the atomic basis for evaluation, enabling step-by-step, assertion-level assessment by both automated evaluators and human experts.
The evaluation framework distinguishes four categories for each LDP:
<correct>: Factual and relevant assertions.<incorrect>: Hallucinations or factual errors.<irrelevant>: Factually correct but extraneous statements.<missing>: Critical assertions omitted by the LLM but added by the evaluator as necessary for completeness.
Quantitative metrics are computed as:
where is the number of <correct> LDPs, is the number of <incorrect>, is the number of <irrelevant>, and is the number of <missing> LDPs [(Enguehard et al., 8 Oct 2025), Section 3.2].
3. Reference-Free Evaluation Workflow
TreeReview operationalizes reference-free evaluation through a four-stage pipeline:
- LLM-based Segmentation: The LLM receives a prompt to decompose the answer into atomic LDPs via assertion segmentation.
- LDP Classification: Each LDP is tagged as
<correct>,<incorrect>, or<irrelevant>. Missing but critical assertions, absent from the original answer, are appended as<missing>. - Score Computation: The metrics outlined above (Correctness, Precision, Recall, F1) are calculated from the LDP tags.
- Human Annotation (Optional): A dedicated UI displays LDP segmentation and tagging, allowing experts to validate, override, or confirm LLM classifications. This step has been shown to improve inter-annotator agreement [(Enguehard et al., 8 Oct 2025), Figure 1].
This framework obviates the need for direct references, instead reflecting the granular review practices adopted by legal professionals.
4. Datasets, Annotation, and Experimental Protocols
Empirical evaluation has primarily utilized a purpose-built proprietary dataset and a curated subset of LegalBench:
Proprietary Dataset
- Comprises 9 contract types (MSAs, SPAs, NDAs, leases, LPAs, etc.), 5 documents per type, ~1,000 QA pairs.
- Each question is authored by legal experts, with gold answers available for select experiments.
- Standard train/test splits were applied (e.g., Lease Agreements 60/40).
LegalBench Subset
- Drawn from Guha et al. (2023), includes a random subset of 12 contract categories, 170 QA examples (150 original, 20 partial/incorrect).
- Focuses on mid-to-high complexity scenarios (e.g., competitive restrictions).
- LLM-generated answers employ Claude 3.5 v1 for response diversity and to prevent answer bias.
- Human rating maps expert assessments of Correctness and Relevance onto a normalized five-point scale, later bucketed to {0, 0.25, 0.5, 0.75, 1} [(Enguehard et al., 8 Oct 2025), App B].
5. Comparative Analysis with Existing Evaluation Methods
TreeReview, as operationalized within LeMAJ, demonstrates substantive improvements over both reference-based and reference-free baseline metrics:
| Metric | Proprietary: Relevance | LegalBench: Relevance | Proprietary: Correctness 0 | LegalBench: Correctness 1 |
|---|---|---|---|---|
| LeMAJ (TreeReview) | 0.370 | 0.354 | 0.259 | 0.700 |
| Best Baseline | 0.174 (BERTScore) | 0.248 (BLEU-4/ROUGE-2) | 0.164 (BERTScore) | 0.203 (ROUGE-2) |
Notably, LeMAJ achieves up to 2–3× higher Pearson correlation with expert gold labels than any baseline. In bucketed accuracy, TreeReview delivers large improvements (e.g., Correctness: 0.95 [LeMAJ] vs. 0.43 [DeepEval-Correctness] for proprietary data; 0.88 vs. 0.08 on LegalBench) [(Enguehard et al., 8 Oct 2025), Tables 2–3].
A key driver of these gains is the atomic, assertion-level granularity: direct penalization of hallucinations (Correctness), explicit identification of irrelevancies (Precision), and explicit recall of omitted but legally critical points (Recall). Existing monolithic metrics entirely miss omissions, a failure mode addressed directly in TreeReview.
6. Expert Alignment, Inter-Annotator Agreement, and Transparency
TreeReview’s LDP-based design results in metrics that are significantly more aligned with subject-matter expert judgments than previous methods. The system’s UI allows for transparent, audit-traceable annotation and enables much-improved inter-annotator agreement (IAA). For instance, Correctness IAA (Cohen’s 2) on 150 LegalBench QA pairs increases from 0.77 (manual) to 0.88 (+11%) when using the TreeReview-enabled annotation interface. For Relevance, IAA remains low due to intrinsic subjectivity, but the LDP framework still provides granular visibility into evaluative discrepancies [(Enguehard et al., 8 Oct 2025), Table 4]. This suggests TreeReview is particularly effective for error localization and audit support in high-stakes legal contexts.
7. Applications, Open Resources, and Broader Implications
TreeReview, via the LeMAJ framework, supports various downstream use cases, including:
- Automated triage of legal QA responses: For example, thresholding at Correctness = 1 and Relevance ≥ 0.80–0.85 allows 30–50% of answers to be safely triaged without further human review, saving up to 50% of expert time within commercial workflows [(Enguehard et al., 8 Oct 2025), Section 5].
- Open-source community benchmarking: Annotations for the LegalBench subset (12 contracts, 150 QA pairs) have been released, promoting reproducibility and future research.
- Extension to other domains requiring assertive, compositional review: The LDP-centric approach may generalize to evaluation tasks in other regulated or detail-critical settings, though further investigation is warranted.
The TreeReview approach exemplifies how aligning automated evaluation procedures with practitioner-centric review protocols can deliver interpretable, reliable, and actionable performance assessment for complex domains such as law.