CAIL2018 Legal Judgment Prediction Challenge

Updated 12 January 2026

CAIL2018 Competition is the first large-scale benchmark for automated legal judgment prediction in China, using over 2.6 million criminal case documents.
It formalizes legal judgment prediction into three subtasks—law article prediction, charge identification, and prison-term estimation—with standardized evaluation metrics.
Participants applied advanced NLP and machine learning techniques, addressing challenges like label imbalance and complex prison-term prediction.

The CAIL2018 Competition, representing the Chinese AI and Law Challenge, established the first large-scale, publicly available benchmark for automated legal judgment prediction targeted at the Chinese judicial system. At its center is a dataset encompassing over 2.6 million single-defendant criminal case documents published by the Supreme People’s Court of China. Designed to mirror the operational realities of the court system, CAIL2018 formalizes the legal judgment prediction (LJP) task as three subtasks: predicting the relevant law articles, predicting the associated criminal charges, and estimating the length of the prison term based on natural-language fact descriptions. The competition catalyzed significant progress in the field by providing data scale, rich annotation, and a rigorous, standardized evaluation framework, drawing participation from over 600 teams and more than 1,100 individual contestants (Xiao et al., 2018, Zhong et al., 2018).

1. Dataset Construction and Preprocessing

The CAIL2018 dataset is derived from 5,730,302 documents sourced from China Judgments Online, each document structured in XML-like format and including varying judicial documents (judgments, verdicts, conciliation statements, decision letters, and notices). The dataset construction pipeline is as follows:

Selection: Only criminal "judgment" documents were retained, with fact descriptions and judgment results extracted via regular-expression matching.
Filtering Multi-Defendant Cases: All cases involving more than one defendant were excluded due to the additional complexity of multi-defendant reasoning.
Label Frequency Thresholding: Law articles or charges with fewer than 30 occurrences, and 102 general criminal law articles not tied to specific charges, were removed to address label sparsity.
Text Segmentation: THULAC was employed for word segmentation, facilitating both TF–IDF vectorization for classical models and tokenization for neural encodings.

Post-filtering, the final dataset contains exactly 2,676,075 cases, each annotated with:

183 law articles (from the Criminal Law of China)
202 criminal charges
The exact prison term in months

Severe label imbalance is inherent: the ten most frequent charges account for over 79% of all cases, while the ten least frequent cover only 0.12%, presenting a pronounced long-tail distribution (Xiao et al., 2018, Zhong et al., 2018).

2. Task Formulation

CAIL2018 formalizes LJP as three subproblems, each mapping from the fact description $X$ to distinct judicial outcomes:

Law-Article Prediction: Multi-label classification: given $X$ , predict the binary vector $\hat{y}^{\text{law}} \in \{0,1\}^{183}$ , where $\hat{y}_i = 1$ indicates the $i$ -th article applies.
Charge Prediction: Multi-label classification: given $X$ , predict $\hat{y}^{\text{chg}} \in \{0,1\}^{202}$ . In practice, most cases have only one charge, functionally reducing the problem to single-label classification in most settings.
Prison-Term Estimation: Regression or classification: given $X$ , predict $\hat{t} \ge 0$ (months of imprisonment), matching the ground truth $t \in \mathbb{N}$ .

Formally, for classification tasks the objective is to minimize cross-entropy or comparable loss; for prison terms, the system minimizes a specialized regression loss over the log scale:

$d_j = \bigl|\log(t_j+1) - \log(\hat t_j+1)\bigr|$

and a piecewise score as defined in the evaluation metrics (Zhong et al., 2018).

3. Evaluation Metrics

Distinct metrics are used for the classification and regression subtasks:

Law Articles and Charges (classification):
- Accuracy: $\frac{1}{N} \sum_{i=1}^N \mathbf{1}[\hat{y}_i = y_i]$
- Macro-Precision (MP), Macro-Recall (MR), and Macro-F₁: mean over all classes, sensitive to imbalance.
- Micro-F₁: aggregates TP, FP, FN globally before computing $F_1$ .
- Combined Score: $S = 100 \times \frac{F_{\text{micro}} + F_{\text{macro}}}{2}$ for leaderboard ranking (Zhong et al., 2018).
Prison Term (regression/classification):
- Piecewise Score: For case $j$ , score is $f(d_j)$ , with $d_j$ defined as log-difference of true and predicted months. Exact scoring is stepwise:
$f(d) = \begin{cases} 1.0 & d \le 0.2 \ 0.8 & 0.2 < d \le 0.4 \ 0.6 & 0.4 < d \le 0.6 \ 0.4 & 0.6 < d \le 0.8 \ 0.2 & 0.8 < d \le 1.0 \ 0.0 & d > 1.0 \ \end{cases}$

with the mean over all cases as the final metric.

A key observation is that while overall micro-average $F_1$ approaches 0.96 on classification, macro metrics stay significantly lower, confirming head-tail label performance gaps (Zhong et al., 2018).

4. Baseline Models and Competition Solutions

Three baselines are established in (Xiao et al., 2018):

Model	Law Acc (%)	Law MP (%)	Law MR (%)	Charges Acc (%)	Charges MP (%)	Charges MR (%)	Terms Acc (%)	Terms MP (%)	Terms MR (%)
FastText	93.3	45.8	38.1	94.3	50.9	39.7	74.6	48.0	24.5
TF–IDF+SVM	92.9	71.8	52.4	94.0	73.9	56.2	75.4	75.4	46.1
CNN	97.6	37.4	21.8	97.6	37.0	21.4	78.2	45.5	36.1

All models were trained on 1,710,856 cases, with 965,219 held out for evaluation. While CNNs achieved higher accuracy, macro-precision/recall are depressed due to dominance by frequent classes. The gap highlights that most accuracy accrues from common labels, not the long tail (Xiao et al., 2018).

Top-performing competition teams adopted neural architectures, augmenting with legal feature extraction, advanced embeddings, and ensemble methods. Strategies included:

Text-CNN, BiLSTM/GRU with attention, Hierarchical Attention Networks (HAN), RCNN, DPCNN
Multi-task and joint models encoding dependencies between law articles, charges, and terms
Data imbalance handling (focal loss: $\ell_{\text{focal}} = -(1-p_t)^\gamma \log p_t$ ), oversampling/undersampling
Pre-training word embeddings on legal corpora; contextual representation (ELMo/BERT)
Extracting explicit structured features (e.g., defendant age, crime amount); named entity patterns
Ensemble techniques (majority or weighted voting)

Prominent teams such as "nevermore," "jiachx," and "xlzhang" achieved micro-F₁ up to 0.96 and macro-F₁ up to ~0.84 for charges, with prison term scores peaking at 77.7 (Zhong et al., 2018). The leading approaches consistently leveraged advanced embeddings, explicit legal attributes, and dependency modeling.

5. Key Challenges and Error Analysis

Two core challenges pervade LJP in CAIL2018:

Label Imbalance: The distribution is extreme, with head labels capturing nearly all cases. Macro metrics demonstrate that even sophisticated loss designs (e.g., focal loss) leave the long tail of rare articles/charges essentially unpredictable: many classes register macro-F₁ < 0.30.
Complexity of Prison-Term Prediction: Estimated terms lag behind charge/law prediction. Top teams observed that factors such as defendant background, remorse, plea bargaining, and discretionary aspects of sentencing are not expressible solely from the fact description, limiting attainable performance.

Error patterns reveal that:

Distinguishing closely related charges (e.g., robbery vs. theft) is infeasible without extra-legal attribute extraction.
Label confusion persists in classes with overlapping factual patterns.
For prison-term estimation, lack of fine-grained features and reasoning over judicial precedents or statutory nuances constrains accuracy (Zhong et al., 2018).

6. Impact and Future Research Directions

CAIL2018’s introduction of a massive, richly annotated legal dataset catalyzed rapid methodological progress in Chinese legal AI. It established standard benchmarks, methodologies, and identified persistent bottlenecks in LJP:

Uniform modeling frameworks now dominate the field, achieving near-saturation on head labels but not on the long tail.
Integrating hierarchical legal knowledge, modeling dependencies via DAGs, and leveraging attribute extraction improve discriminative ability.
The task’s complexity has motivated research into:
- Imbalance-aware training (resampling, meta-learning)
- Pre-trained LLMs (BERT, XLNet) specifically tuned for legal domains
- Graph neural networks over statute or case graphs
- Interpretation mechanisms to align fact segments with legal outcomes
- Few-shot and zero-shot generalization for emerging legal categories
- Incorporation of richer metadata (judge IDs, court hierarchies) and temporal splits for robust evaluation

A plausible implication is that progress on LJP tasks such as those posed by CAIL2018 is tightly coupled to advances in both representation learning for long-tail categorization and structured, interpretable reasoning over legal facts and statutes (Zhong et al., 2018).

7. Significance within Legal AI

By presenting a realistic, challenging benchmark at unprecedented scale, CAIL2018 has become a foundational resource for the legal judgment prediction community. Its framework, metrics, and curated dataset have set the methodological standard, ensuring advances translate to actual courtroom conditions and catalyzing further research into interpretability, fairness, and cross-domain generalization of legal AI (Xiao et al., 2018, Zhong et al., 2018).