CLaw Benchmark: Evaluating Legal LLMs

Updated 23 November 2025

CLaw Benchmark is a comprehensive testbed defining legal-domain evaluation by assessing Chinese legal statutes and case-based reasoning.
The corpus includes 64,849 segmented entries from 306 national statutes, enhanced with temporal versioning for precise retrieval assessment.
Empirical evaluations show top LLMs scoring up to 84.77 overall, highlighting both strengths in reasoning and challenges in statutory recall.

The CLaw Benchmark refers to a series of recent benchmarks evaluating LLMs on legal knowledge acquisition, statutory recall, and legal reasoning. This article focuses on CLaw as introduced in "CLaw: Benchmarking Chinese Legal Knowledge in LLMs - A Fine-grained Corpus and Reasoning Analysis" (Xu et al., 25 Sep 2025), while also acknowledging the history of the “CLaw Benchmark” as it appeared in earlier drafts of robotics benchmarks such as DeepClaw (Wan et al., 2020). The CLaw framework provides the first large-scale, fine-grained testbed for statutory knowledge and case-based reasoning in Chinese law, advancing the measurement and analysis of legal-domain LLMs in both retrieval and reasoning contexts.

1. Corpus Construction and Structure

CLaw’s corpus is built to provide exhaustive coverage of Chinese national statutes and their historical versions, segmented to maximize the granularity of statutory evaluation. The design is oriented toward rigorous, temporally-aware retrieval assessment and fine-grained legal reasoning.

Statutory Coverage: The dataset comprises all 306 national statutes promulgated by Presidential Order in China. Each statute includes multiple revision timesteps, totaling 424 distinct versions.
Hierarchical Segmentation: Statutes are parsed to the subparagraph level. The hierarchy is:
- law-name → version-date → article → paragraph → subparagraph.
- If an article lacks paragraphs/subparagraphs, it defaults to a one-block structure.
Dataset Size: The finalized corpus includes 64,849 entries:
- Articles: 8,712
- Paragraphs: 19,343
- Subparagraphs: 36,794
Temporal Awareness: Each statute entry is tagged with its revision date, allowing for temporally-correct evaluation (e.g., as of a particular promulgation).

The following table summarizes the corpus statistics:

Statutes	Versions	Articles	Subparagraph Entries
306	424	8,712	64,849

The explicit inclusion of versioning enables systematic paper of LLMs’ capacity to track legal changes and engage in time-sensitive recall tasks (Xu et al., 25 Sep 2025).

2. Case-Based Reasoning Dataset

CLaw's second component tests legal reasoning via real-world Chinese Supreme People’s Court Guiding Cases.

Case Set: 254 instances, each derived from the "Guiding Cases" corpus (2 cases removed due to lack of legal effect).
Annotation Structure:
- Case Details
- Court Rulings (multiple judicial levels)
- Dispute Focus
- Judicial Reasoning
- Cited Legal Articles (down to the subparagraph)
Prompt Engineering: Focus questions are rephrased (using GPT-4o) so that LLMs must explicitly identify the core legal issue and cite relevant statutes with temporal and structural precision.
Evaluation: Two LLMs (Gemini-2.5-Pro, DeepSeek-R1) score system outputs on five axes: Rigor of Reasoning, Accuracy of Knowledge, Logical Structure, Clarity, and Conciseness, on a 0–20 scale. An overall score in [0,100] is computed per response.

The reasoning dataset is explicitly constructed to stress LLM capabilities in issue spotting, statutory application, analogy, and legal argument structure (Xu et al., 25 Sep 2025).

3. Evaluation Protocols and Metrics

CLaw's evaluation setup quantifies both exact recall and semantic similarity, with separate protocols for retrieval and case reasoning.

3.1 Statutory ID Retrieval

Hierarchical Accuracy: For $N$ test questions, compute per-level accuracy as

$Acc_{\rm article} = \frac{C_{\rm article}}{N}\times 100\%,\quad Acc_{\rm paragraph} = \frac{C_{\rm paragraph}}{N}\times 100\%,\quad Acc_{\rm subparagraph} = \frac{C_{\rm subparagraph}}{N}\times 100\%$

where $C_{\rm article}$ , $C_{\rm paragraph}$ , $C_{\rm subparagraph}$ are counts of fully correct matches at each level.

3.2 Content Retrieval

Metric suite includes:

ROUGE-N, ROUGE-L for n-gram and longest common subsequence overlap.
BLEU (up to 4-gram) to assess n-gram-level fidelity.
Levenshtein Edit Distance (normalized) for sequence similarity.
BERTScore F1 to estimate semantic similarity.

3.3 Case Reasoning

LLM-Judge Scores: Multi-dimensional rubric on 0–20 per Reasoning, Knowledge, Structure, Clarity, Conciseness.
Overall Rating: Synthesis of rubric scores, aligned with expert human judgements (Pearson $r=0.82$ LLM-human agreement, $r=0.96$ between experts).

This multi-protocol structure enables direct measurement of both verbatim recall and deeper legal reasoning performance (Xu et al., 25 Sep 2025).

4. Empirical Performance and Failure Patterns

CLaw presents a comprehensive empirical analysis of ten contemporary LLMs on knowledge recall and reasoning.

Statutory Recall:
- Article-level: Best model (Doubao-1.5-Pro) achieves $58.8\%$ , but only $20-30\%$ at paragraph-level, and under $20\%$ at subparagraph level.
- Model Trends: DeepSeek-R1 surpasses others in fine-grained localization, highlighting variation in structure-aware retrieval.
- Failure Modes: Common errors include fabricating statutory text, incorrect version retrieval, and blanket refusals (“I cannot access…”).
Content Recall:
- Doubao-1.5-Pro achieves BERTScore F1 $=0.808$ , BLEU $=0.451$ (nearly twice the runner-up); global models like GPT-4o lag (BLEU $=0.018$ ).
- Global models sometimes paraphrase reasonably (BERTScore $≈0.59$ ) but do not reproduce statutes verbatim.
Reasoning Scores:
- Gemini-2.5-Pro leads ($84.77$), DeepSeek-R1 ($83.30$), with highest scores on reasoning and knowledge.
- Pearson $r$ between recall accuracy and reasoning is $0.61$ (article), $0.70$ (paragraph), $0.68$ (subparagraph): high retrieval accuracy correlates with, but does not guarantee, better reasoning.

The following table presents representative mean scores (Gemini-judge rating):

Model	Overall	Reasoning	Knowledge	Structure	Clarity	Concise
o1	54.87	8.90	9.81	14.26	14.93	18.56
GPT-4o	65.81	11.37	13.55	16.45	17.30	19.72
Gemini-2.5-Pro	84.77	19.00	18.48	19.95	19.70	19.50

Accurate statutory recall is a significant but not sufficient condition for excellent legal reasoning (Xu et al., 25 Sep 2025).

5. Failure Analysis and Recommendations

Analysis of error patterns shows the principal failure modes for LLMs:

Citations: Hallucination of provisions, outdated or mis-numbered statute references.
Incomplete Application: Correct statutory retrieval without adequate linkage to case facts.
Logical Structure: Superficial analogizing, lack of major/minor premise elaboration, and gaps in legal argumentation.
Statutory Evolution: LLMs struggle with legislative changes, yielding outdated or mismatched responses where precise version tracking is essential.

Recommendations for improvement include:

Supervised Fine-Tuning (SFT): Incorporate the full fine-grained CLaw corpus to enable granular structural mastery.
Retrieval-Augmented Generation (RAG): Introduce citation verification and temporal control in prompt design and inference.
Pretraining Objectives: Emphasize document structure encoding and dynamic version tracking in model objectives.
Advanced Evaluation: Move beyond single-issue cases to drafting, predictive judgments, and complex disputation scenarios to measure real-world efficacy.
Source Vetting in RAG: Improve statutory source filtering and credibility assessments to minimize hallucinations and version mismatches.

A plausible implication is that without robust retrieval and up-to-date statute modeling, even SOTA LLMs will remain unreliable in domains with critical temporal and document structure constraints (Xu et al., 25 Sep 2025).

6. Impact, Limitations, and Future Directions

CLaw establishes a new empirical foundation for LLM research in legal reasoning, but exposes systemic weaknesses in general-purpose models:

Impact: Provides a powerful open dataset (64,849 entries, 254 annotated cases) for evaluating statutory retrieval and legal reasoning, spurring further research in domain-specific LLM adaptation.
Limitations: Statute recall remains substantially below legal practice requirements, especially at subparagraph fidelity. Systematic handling of statutory evolution, multi-issue reasoning, and nuanced drafting tasks is unaddressed.
Future Work: Recommended directions include closed-loop RAG with version grounding, semi-automatic expansion to other legal systems, curriculum learning on complex legal arguments, and evaluation on predictive and multi-step legal analysis.

CLaw defines a rigorous, reproducible, and evolving standard for evaluating and improving LLMs within high-stakes, structure-rich professional domains such as law (Xu et al., 25 Sep 2025).