Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Published 26 Apr 2026 in cs.AI | (2604.23730v1)

Abstract: LLMs have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in realistic scenarios remains insufficiently explored. Notably, to our best knowledge, there are no prior studies or datasets addressing this issue in the Japanese context. This study presents the first dataset designed to evaluate the open-ended legal reasoning performance of LLMs within the Japanese jurisdiction. The dataset is based on the writing component of the Japanese bar examination, which requires examinees to identify multiple legal issues from long narratives and to construct structured legal arguments in free text format. Our key contribution is the manual evaluation of LLMs' generated responses by legal experts, which reveals limitations and challenges in legal reasoning. Moreover, we conducted a manual analysis of hallucinations to characterize when and how the models introduce content not supported by precedent or law. Our real exam questions, model-generated responses, and expert evaluations reveal the milestones of current LLMs in the Japanese legal domain. Our dataset and relevant resources will be available online.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a novel dataset and expert evaluation protocol to assess LLM performance on complex open-ended legal reasoning tasks in a Japanese bar exam context.
It reveals that state-of-the-art LLMs achieve low expert scores—with only 9% reaching an 'Adequate' level—and exhibit high rates of hallucinated legal citations.
Methodological analysis shows that in-context learning enhanced with statutory excerpts (FS+Law) reduces hallucinations but still falls short in satisfying rigorous legal reasoning requirements.

Expert Evaluation of LLMs on Open-Ended Legal Reasoning in the Japanese Bar Exam

Motivation and Background

LLMs have reported high performance on legal NLP benchmarks, predominantly in tasks involving classification, entailment, or short-form retrieval. However, such metrics inadequately assess LLM capabilities in authentic legal workflows requiring complex, open-ended, and structured reasoning. In the Japanese jurisdiction, the bar examination's writing component is uniquely rigorous, demanding multifaceted legal analysis, issue spotting, application of statutes and precedents, and structured argumentation—an evaluative context previously neglected in LLM research.

This study introduces the first dataset specifically constructed to scrutinize LLM-generated open-ended legal arguments on real Japanese bar exam questions. Unlike multiple-choice evaluations, the focus here is on narrative legal reasoning, with expert legal professionals providing granular manual assessments and hallucination checks. The work directly addresses the need for domain-grounded and realistic evaluation of LLMs in legal praxis.

Dataset Construction and Annotation Protocol

The dataset encompasses all subjects from the writing section of the Japanese National Bar Examination (2017–2023), including core areas such as Civil, Commercial, Penal, Constitutional, and Administrative Law, as well as procedural codes. Each record comprises the full text of the exam prompt, LLM-generated responses, and detailed statutory references. Statutory inputs were manually curated to control for legal knowledge scope and maintain alignment with the exam's temporal legal context.

Expert evaluation is provided by legal faculty, with each subject evaluated by a domain specialist. Assessments adopt both categorical gradings (Excellent, Good, Adequate, Poor) and a continuous 0-100 scale, mirroring actual bar exam scoring. Crucially, the annotation includes qualitative comments highlighting errors in factual analysis, statutory referencing, and coherence, as well as structured hallucination analysis segregating law articles from judicial precedents.

Experimental Framework

Three state-of-the-art LLMs—OpenAI's GPT-4o, Anthropic's Claude 3 Opus, and OpenAI's o3—were evaluated under three prompting regimes:

Zero-Shot (ZS): Direct question input only.
Few-Shot (FS): In-context learning using sample questions and model answers.
FS+Law: FS setting augmented with gold statutory excerpts per question.

A total of 189 answers (21 questions taken from three recent years, each model in each setting) constitute the experimental corpus. System prompts explicitly positioned the models as Japanese legal experts, constraining all legal reasoning to the relevant temporal framework.

Results: Performance Analysis

Across all subjects and settings, LLM outputs received an average expert score of 25.6/100, with only 9% of responses reaching an “Adequate” threshold and a mere 5.3% attaining “Good.” No LLM response was rated as “Excellent,” signifying a pronounced gap between LLM-generated arguments and the level required for Japanese legal qualification.

FS+Law generally improved scores over ZS and FS, particularly for Claude 3 Opus, where inclusion of statutory articles led to a mean increase from 30.4% (FS) to 37.1% (FS+Law). GPT-4o persistently underperformed, receiving uniformly “Poor” ratings, while o3 demonstrated marginal improvement with FS+Law but was otherwise stable across settings.

Expert comments emphasized that in-context learning (FS, FS+Law) improved structural compliance (e.g., reduction in fragmented or bullet-form answers), but had limited effect on the models’ capacity for accurate fact-law mapping. Common critical points included missing/incorrect statuary references, superficial or erroneous factual reasoning, and breakdowns in logical structure.

Hallucination Analysis

A rigorous, post-hoc manual examination cataloged every model-cited law article and precedent, labeling each as grounded or hallucinated. Key findings include:

Of 2,575 statutory and 441 precedent mentions, hallucination rates were 14.8% (laws) and 66.9% (precedents).
Hallucinations in legal citations are highly model- and prompt-dependent: o3 exhibited hallucination rates up to 34.7% (ZS), dropping to 22.8% (FS+Law), whereas Claude 3 Opus in FS+Law achieved the lowest rate at 5.1%.
Inclusion of statutory articles (FS+Law) consistently reduced hallucination rates across all models and subjects, with the most substantial effect for Claude 3 Opus.
Procedural law subjects, particularly Code of Criminal Procedure, showed both lower overall reference rates and elevated hallucination risk, likely reflecting underrepresented or poorly modeled domains within LLM training data.

The high rate of hallucinated precedent references highlights deficient coverage of the Japanese legal corpus in LLM pretraining and underscores a critical risk for real-world legal adoption.

Implications and Prospective Directions

This study empirically demonstrates the severe disconnect between LLM legal NLP benchmark success and the requirements of genuine legal analysis in high-stakes, text-based evaluation. LLMs do not currently exhibit sufficient capacity for structured open-ended reasoning, reliable subject-specific citation, or robust application of complex facts to law—even under conditions of gold statute provision and strong prompts.

Practical implications are acute: automated legal reasoning systems relying on generic LLMs lack the reliability necessary for professional deployment, especially given the prevalence of hallucinated citations, misevaluation of complex scenarios, and brittle engagement with nuanced statutory frameworks.

Prospective developments must focus on:

Pretraining/Finetuning on national legal corpora: Improved coverage of Japanese (and other non-English) statutes and precedents
Retrieval-augmented generation (RAG): Tight integration with up-to-date and jurisdiction-specific legal databases to ground outputs in real sources
Domain-specific evaluation sets: Expansion of expert-annotated, open-ended datasets across diverse legal domains
Advanced prompt engineering and reasoning scaffolds: To foster deeper chain-of-thought performance, especially for multi-issue legal analysis

The challenges identified here generalize to other civil law jurisdictions and underline the necessity of rigorous, expert-oriented evaluation to diagnose and mitigate LLM failure modes in domain-critical settings.

Conclusion

The dataset and expert evaluation protocol introduced in this study provide the first rigorous diagnostic probe of LLMs' ability to perform Japanese bar exam-level open-ended legal reasoning. Despite advances in LLM generative capabilities, current models remain inadequate for structured legal argumentation, plagued by systemic hallucinations and shallow engagement with complex fact patterns. Robust, grounded, and high-fidelity legal NLP will require substantial innovation in both legal data curation and model conditioning.

Markdown Report Issue