LegalOne: Chinese Legal Reasoning AI
- LegalOne models are compact foundation models tailored for the Chinese legal domain, integrating structured training for factual grounding and logical rigor.
- The models use a three-phase pipeline—pretraining, mid-training with Plasticity-Adjusted Sampling, and supervised fine-tuning via LEAD—to refine legal reasoning.
- A curriculum reinforcement learning approach and the LegalKit evaluation framework empower LegalOne to match or outperform larger LLMs in legal tasks.
LegalOne refers to a family of foundation models developed for precise, reliable legal reasoning within the Chinese legal domain, achieving state-of-the-art performance across a range of legal tasks. These models—LegalOne-1.7B, LegalOne-4B, and LegalOne-8B—are distinguished by a comprehensive, multi-phase training pipeline that addresses the challenges of domain adaptation, explicit structured reasoning, and staged reinforcement learning. LegalOne sets a benchmark for compact legal LLMs in performance and interpretability, with a particular emphasis on factual grounding and logical rigor across judicial scenarios. The models and the LegalKit evaluation framework are publicly available for research and application in high-stakes judicial contexts (Li et al., 31 Jan 2026).
1. Model Family, Architecture, and Data
LegalOne is initialized from the open-source Qwen3-Base checkpoint, retaining the decoder-only transformer backbone with a 32,000 token context window, Rotary Positional Embeddings (RoPE, base=1e6), FlashAttention, and RMSNorm. Three model variants exist:
| Model | Parameters | Layers | Hidden Size |
|---|---|---|---|
| LegalOne-1.7B | 1.7B | ~12 | ~2048 |
| LegalOne-4B | 4B | ~24 | ~3072 |
| LegalOne-8B | 8B | ~32 | ~4096 |
Optimization uses AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1, grad_clip=1.0) with a three-phase learning rate schedule: rapid warmup to ηₘₐₓ=3×10⁻⁴ over 2K steps, stability, then linear decay to 3×10⁻⁶ over 10B tokens.
The pretraining and mid-training corpus consists of ∼100B tokens, partitioned 40% general-domain (FinWeb-Edu, FinePDFs, FineWiki, SkyPile-150B, IndustryCorpus, OpenNewsArchive, MathPile, Wanjuan, BaiduBaike-5.63M) and 60% legal-domain (recent court judgments, statutes, law reviews, textbooks, commentaries, and consultations). Data undergoes OCR, normalization, field verification, MinHash deduplication, length filtering, and mixed-language noise removal.
Synthetic data generation includes:
- Reasoning data with structured chains of thought (CoTs) from Qwen3-255B-A22B-Thinking, leveraging the Feynman Technique.
- Paraphrasing for style diversity using the Kimi-K2 paradigm, filtered for semantic consistency.
A perplexity-based filter (PPL≤30 under Qwen3-4B-base) retains only high-quality Chinese legal texts, with a minor fraction of high-quality English general texts.
2. Mid-Training: Plasticity-Adjusted Sampling (PAS)
Plasticity-Adjusted Sampling (PAS) addresses the optimization instability that arises when augmenting a pre-trained model with domain-specific data ("re-warmup" phase at higher learning rates). The approach uses sample-level perplexity (PPL) to partition training data into anchor (low-PPL) and adaptation (higher-PPL) buckets.
At each training step , the plasticity coefficient modulates sampling: initially, the scheduler biases toward anchor data (numerical damper effect minimizing the influence of high-curvature regions); as training progresses and , sampling distribution relaxes to target weights .
For a corpus divided into buckets , sampling probability at time is given by:
where controls anchor emphasis and is the indicator for the lowest-PPL bucket. Pseudocode in the paper formalizes batch selection and update dynamics.
This approach stabilizes the learning trajectory, avoiding catastrophic forgetting and enabling effective knowledge transfer from generic to legal-domain distributions.
3. Supervised Fine-Tuning: Legal Agentic CoT Distillation (LEAD)
Legal Agentic CoT Distillation (LEAD) comprises a structured approach for supervised fine-tuning (SFT), extracting explicit, multi-step legal reasoning from authentic judicial documents.
The LEAD workflow includes:
- Prompt Collection: Structuring documents into Fact, Reasoning, and Decision; linking statutes to external knowledge bases; stratified sampling for case diversity; filtering by completeness and complexity (≥3/5 via LLM scoring).
- Question Generation:
- Structural distillation (Fact→Reasoning, etc.)
- Multi-perspective simulation (LLM role-play as litigant, attorney, judge)
- Real-world user query mining from consultation platforms
- Agentic CoT Synthesis: Decomposing legal tasks into a state machine (e.g., Fact Finding, Issue Identification, Rule Retrieval, Rule Deduction, Conclusion Derivation), each step managed by agent protocols and templates integrating professional SOPs and knowledge base retrieval.
- Trajectory Refinement: Internalizing knowledge (removing explicit citations, restructuring reasoning as intrinsic understanding), and merging fragmented traces into globally ordered, non-redundant chains of thought.
- Quality Control: Heuristic dropping of incomplete/code-mixed samples; LLM-as-judge evaluation over six axes (reasoning, consistency, alignment, conciseness, linguistic, overall) with minimum subscore thresholds.
The distillation loss targets token-level cross-entropy: Training parameters include max LR=2×10⁻⁵, 2 epochs, batch size=32, context=32K tokens, total SFT samples ≈500K.
4. Curriculum Reinforcement Learning and Reward Structuring
LegalOne’s RL curriculum comprises five task types across three capability stages:
- Memorization: Statute completion (article number → statutory text, ROUGE-L reward)
- Understanding: Statute application (find all relevant provisions, soft F₁ via ROUGE >0.5 match)
- Reasoning:
- Case Analysis (16-choice MCQs, accuracy reward)
- Judgment Prediction (civil/criminal/admin, reward=1 for correct quantitative value within ±10%)
- Legal Document Generation (drafting, rubric-trained evaluator reward)
Curriculum RL is implemented via DAPO with a token-level baseline: for each trajectory of length and sequence reward , assign every token in sample the reward . Token-level baseline is shown to be unbiased and exhibit strictly lower variance relative to sequence-level, mitigating the length-reward correlation noise in gradient estimation:
This progressive RL regime enables the model to evolve from rote pattern matching to advanced, autonomous legal reasoning.
5. Experimental Results and Ablation Analyses
LegalOne is evaluated with LegalKit, incorporating benchmarks such as JEC-QA (National Unified Legal Exam, knowledge-based and case analysis) and LexEval (taxonomy: Memorization, Understanding, Logical Inference, Discrimination, Generation, Ethics). Generation is scored by Qwen3-235B-A22B as the LLM judge.
Selected quantitative results:
| Model | Params | LexEval Avg | JEC-QA Avg |
|---|---|---|---|
| GPT-4o-mini | 40B | 52.91 | 28.49 |
| GPT-4o | 80B | 62.32 | 42.23 |
| Qwen3-235B-A22B | 235B | 69.47 | 62.07 |
| Kimi-K2-1T | 1000B | 72.81 | 55.34 |
| LegalOne-1.7B | 1.7B | 60.56 | 41.24 |
| LegalOne-4B | 4B | 67.59 | 55.68 |
| LegalOne-8B | 8B | 71.19 | 62.36 |
LegalOne-8B achieves parity or outperforms massive general LLMs on case analysis (JEC-QA, 61.52) and ranks second in LexEval Discrimination and Ethics, demonstrating superior knowledge density and efficiency for its scale. General-domain evaluations (GSM8K, Math500, C-Eval) confirm preserved or improved logical reasoning post-RL.
Ablation studies reveal that mid-training with PAS is critical to robust domain adaptation (especially for 1.7B), SFT alone is detrimental for small models, and curriculum RL delivers pronounced gains in larger models, suggesting emergent scaling effects. Qualitative analysis shows LegalOne-8B’s reasoning mimics professional legal logic (e.g., employing the "Three-Tier Theory of Crime").
6. Limitations and Prospective Enhancements
Known limitations include:
- Capacity constraints in ~1.7B models, leading to general knowledge-expertise tradeoff
- Coverage bias toward recently published material; older or niche case law underrepresented
- Strictly Chinese-centric design, with no explicit cross-jurisdictional reasoning
- Risk of artifacts in synthetic reasoning/paraphrase data if generation bias persists
Future directions prioritized by the research include:
- Extending data coverage to IP, antitrust, international law, and multilingual corpora
- Real-time legal database integration for authoritative citation grounding
- Refinement of RL rewards to capture logical consistency within chains of thought
- Human-in-the-loop auditing for fairness/bias oversight in deployment
- Investigation of modular architectures with specialist adapters or retrieval components for greater interpretability and computational efficiency
7. Significance and Impact in Legal AI
LegalOne establishes a reproducible, interpretable framework for deploying foundation models in professional legal settings. The integration of PAS, LEAD, and multi-stage curriculum RL enables not only robust fact and statute memorization, but also authentic, multi-step judicial reasoning. LegalKit’s open evaluation empowers further research and benchmarking. A plausible implication is the catalysis of reliable, domain-sensitive LLMs for other high-stakes domains, provided similar methodological rigor is maintained (Li et al., 31 Jan 2026).