LegalSemi: Malaysian Contract Law Benchmark
- LegalSemi is a benchmarking resource that automates IRAC legal analysis in Malaysian contract law using annotated scenarios and a graph-based knowledge base.
- It features 54 detailed scenarios covering 55 subtopics, with expert annotations ensuring high-quality legal reasoning.
- The framework boosts LLM performance in rule retrieval, issue identification, and application generation, demonstrating measurable gains via retrieval-augmented techniques.
LegalSemi is a benchmarking resource and semi-structured workflow for automating IRAC (Issue, Rule, Application, Conclusion) legal analysis in the context of Malaysian Contract Law. Developed for researchers working on legal natural language processing and reasoning, LegalSemi delivers a corpus of extensively annotated contract law scenarios and a graph-based knowledge base, facilitating both empirical benchmarking of LLMs and deployment of reasoning pipelines in legal practice (Kang et al., 2024, Kang et al., 2023).
1. Structure and Scope of the LegalSemi Benchmark
LegalSemi consists of 54 scenarios representing complex, real-world fact patterns in Malaysian contract law. These scenarios systematically cover contract formation topics: offer and acceptance, consideration, certainty (including promissory estoppel), capacity (minors and mental incapacity), and intention to create legal relations. The dataset subsumes 55 sub-topics (e.g., invitation to treat, counter-offers, voidable contracts), and individual scenarios average approximately 800 words, reflecting the granularity and ambiguity typical of legal exams and professional memos (Kang et al., 2024).
Expert annotation was conducted by four second-year law students and two junior lawyers, using a web-based tool to produce JSON and text representations. Each scenario is decomposed into:
- Issues: Stated as precise questions, 243 in total.
- Rules: 268 citations (44 unique sections) from the Contracts Act 1950 and 76 court cases.
- Applications: 607 chain-of-reasoning steps articulated in semi-structured IF…THEN patterns.
- Conclusions: Concise answers to the Issues, with no new legal principles introduced.
Quality assurance involved double annotation, inter-annotator agreement (κ > 0.75), and expert review (Kang et al., 2023).
2. The Semi-Structured Knowledge Base (SKG): Design and Function
LegalSemi’s companion semi-structured knowledge base (SKG) employs a graph data model (Neo4j), containing 3,114 nodes and 1,811 edges spanning eight node types:
- Chapter, Title, Section
- Interpretation (plain-language paraphrases from "Law for Business")
- Extended Content (cases and textbook notes)
- Main Concept, Subconcept, Sub-subconcept
Legal rules, statutory texts, plain-language interpretations, and annotated case law are linked in the graph through BELONGS_TO, HAS_TITLE, MENTIONS, CONCEPT_OF, and various hierarchical relations. For each scenario, legal concepts are mapped to MainConcept nodes, which enables SKG traversal to related sections and interpretations, supporting precise rule retrieval and traceability during IRAC annotation (Kang et al., 2024).
Application steps in the corpus reference SKG rule nodes, ensuring each inference’s provenance is explicit. Interpretations—both from textbooks and, where present, LLM-generated summaries—help bridge the gap between formal legalese and practical reasoning.
3. IRAC Methodology in Malaysian Contract Law
IRAC structures case analysis in four components, mapped as follows (Kang et al., 2023):
- Issue: The legal question raised, framed under the Contracts Act but not mapped to a specific section.
- Rule: Citations of statutory or precedential sources (e.g., s 2(d) for consideration, s 10 for competency, s 64 for discharge by satisfaction, and relevant cases).
- Application: Sequential reasoning, often defeasible, implemented with conditional logic in a template:
- IF {fact/concept} AND/OR {fact/concept} THEN {legal conclusion} [RuleCitation].
- Conjunctions (AND, OR, HOWEVER) reflect alternate and non-monotonic inferences.
- Conclusion: Succinct outcome derived directly from the Application (e.g., “Bob and Jane cannot demand repayment of the whole debt” for a scenario involving s 2(d) and s 64).
This methodology is instantiated in both annotation guidelines and evaluation protocols for LLM outputs. The semi-structured representation enables clear mapping between scenario facts, legal rules, intermediate analyses, and outcomes (Kang et al., 2023).
4. Automated Analysis and Evaluation Using LLMs
LegalSemi provides for systematic evaluation of LLMs’ legal reasoning under both zero-shot and prompt-engineered conditions (Kang et al., 2023, Kang et al., 2024). The methodology encompasses:
- Legal Concept Identification: Most LLMs reach ~0.50 F₁ on top-level concepts but underperform on granular sub-concepts (F₁ < 0.20).
- Issue Identification: Explicit legal concept context raises issue quality by 21.4% (GPT-3.5 turbo).
- Rule Retrieval: TF-IDF–based retrieval via the SKG boosts top-5 F₁ from 2.5% (pure text search) to 16.3%, especially when using lay-language interpretations.
- Application Generation: Application F₁ rises by 18.9 percentage points for GPT-3.5 turbo when issues and retrieved rules are provided.
- Conclusion Generation: Accuracy increases by 71.4% with explicit Application text as input (GPT-3.5 turbo).
Each automated IRAC stage is continuously auto-evaluated by a reference LLM with a rubric strongly correlated (ρ ≈ 0.86–0.91) with human examiners (Kang et al., 2024).
5. Error Analysis, Limitations, and Engineering Recommendations
Key limitations in LLM-driven IRAC analysis emerge in rule retrieval, logical chaining, and legal precision:
- Only 1/40 model-generated scenarios displayed fully correct rule citations.
- Application and conclusion steps are often over-generalized and not precisely tied to the relevant section or fact.
- LLM reasoning is mostly monotonic, struggling to model defeasible logic or nested logical constructs (e.g., IF X AND (Y OR Z)…).
- LLM outputs exhibit high fluency but lack legal rigor, sometimes prioritizing surface coherence or hallucinated interpretations (Kang et al., 2023, Kang et al., 2024).
Recommended improvements include:
- Enforcing semi-structured IRAC prompting (with explicit statute placeholders, logical operators).
- Retrieval-augmented generation (RAG) using SKG lookups.
- Chain-of-thought decomposition for multi-factor disputes.
- Human-in-the-loop quality gates (e.g., missing statutes trigger secondary review).
- Supervised fine-tuning on LegalSemi’s annotated corpus to internalize legal reasoning patterns.
- Expansion of the SKG to encompass more case law and interpretive content.
6. Practical Applications and Research Impact
LegalSemi and the SKG underpin a dual pipeline: legal education (teaching IRAC structuring and statutory mapping in law schools) and professional augmentation (QA-assistants for legal drafting). Empirically, the resources facilitate benchmarking and ablation studies of LLMs, exposing current bottlenecks in concept identification, rule retrieval, and logical coherence.
A plausible implication is that neuro-symbolic integration—embedding SKG query outputs into LLM attention or RAG architectures—may further bridge the gap between LLM fluency and statutory precision, potentially leading to hybrid rule engines for conditional legal reasoning.
In sum, LegalSemi establishes a foundational, empirical, and extensible resource for contract law AI benchmarking, driving methodological innovation and transparency in statutory legal reasoning automation (Kang et al., 2023, Kang et al., 2024).