MADON: Judicial Argument Mining Dataset
- MADON is a curated dataset of 272 Czech apex-court decisions, featuring expert-labeled paragraph-level argument types and holistic formalism labels.
- It employs a dual-level annotation schema that differentiates formalistic and non-formalistic reasoning, ensuring detailed legal argument analysis.
- State-of-the-art NLP pipelines using ModernBERT and Llama 3.1 8B demonstrate high classification accuracy and enhanced replicability for judicial studies.
The MADON (Mining Arguments in Decisions of Nations) dataset is a curated resource of annotated legal documents from the Czech Supreme Court and Supreme Administrative Court, designed to enable the systematic and replicable study of judicial reasoning and formalism through advanced NLP methodologies. It provides fine-grained, expert-labeled paragraph-level argumentation types and decision-level holistic formalism labels, making it a cornerstone for research on computational legal studies, argument mining, and comparative judicial philosophy (Koref et al., 12 Dec 2025).
1. Data Composition and Annotation Schema
The dataset comprises 272 apex-court decisions: 182 from the Supreme Court and 90 from the Supreme Administrative Court, encompassing 9,183 paragraphs. Of these, 1,237 paragraphs contain at least one argument annotation, totaling 1,913 argument instances. Annotation follows a dual-level schema: detailed paragraph-level argument types and a binary, decision-level holistic formalism label.
Eight argument types are delineated, grouped as follows:
- Formalistic (text-bound) argument types:
- LIN (Linguistic Interpretation): Focus on statutory text, grammar, and unambiguous diction.
- SI (Systemic Interpretation): Placement within the legal system, including lex specialis/posterior and conformity with constitutional or EU law.
- CL (Case Law): Concrete citations of domestic or European judgments.
- D (Doctrine): References to scholarly works and commentaries.
- Non-formalistic (purpose- and principle-driven) argument types:
- HI (Historical Interpretation): Legislative history and explanatory memoranda.
- PL (Principles of Law & Values): Appeals to foundational principles, such as good faith and legal certainty.
- TI (Teleological Interpretation): Purpose-oriented reasoning, analogies, aim-based interpretation.
- PC (Practical Consequences): Consequentialist considerations of legal interpretations.
Each decision also receives a holistic formalism label, established via expert review against five tenets of Central and Eastern European (CEE) formalism: 1) prevalence of text-bound arguments, 2) narrow focus on local rule, 3) exclusion of external sources, 4) procedural dismissals, 5) scarce reasoning.
2. Data Collection, Sampling, and Preprocessing
Stratified sampling from a corpus of 300,000 Czech court decisions (1997–2024) ensures coverage across the civil (122), criminal (58), and administrative (90) domains, as well as balancing procedural and on-merits decisions and spanning the period 2003–2023. Text extraction enforces anonymization, and segmentation respects original court paragraph structures. Tokenization is tailored for models supporting sequences up to 32,000 tokens, and pilot-study decisions are excluded from pretraining corpora to prevent data leakage.
Dataset partitions are stratified by both court and holistic label into training (70%), validation (20%), and test (10%) splits. The resulting configurations, and notable class imbalances (87% paragraphs without arguments; CL arguments 37.4% of the total, LIN 5.7%, TI+PL 32.3%), are summarized below:
| Court / Label | Train | Val. | Test | Total |
|---|---|---|---|---|
| Formalistic | 112 | 32 | 17 | 161 |
| Non-Formalistic | 77 | 22 | 12 | 111 |
| Supreme Court | 127 | 36 | 19 | 182 |
| Supreme Admin. Court | 62 | 18 | 10 | 90 |
| Total | 189 | 54 | 29 | 272 |
3. Annotation Protocol, Quality Control, and Agreement
Annotation was performed by four Charles University law students, supervised by a PhD researcher in legal argumentation. Guideline development synthesized theoretical frameworks (Alexy 2010, Maccormick 2016, Walton et al. 2021) and local legal-methodology literature, iteratively refined across 200 pilot decisions.
The process comprised:
- One-week introduction, five weeks of training (60+ decisions), six weeks of primary annotation (272 decisions), and a finalization week.
- Weekly consistency reviews.
- Use of the INCEpTION annotation platform over 1,000+ annotator hours.
Quality is evidenced by substantial inter-annotator agreement: Cohen’s κ = 0.65 for the binary holistic label, and Krippendorff’s unitized α_u ranging from 0.94–0.95 (case law), 0.90–0.94 (doctrine), 0.76–0.65 (principles), 0.68–0.80 (historical), 0.63–0.65 (teleological). Lower agreement categories were subject to arbiter review and forced-agreement rounds.
4. Model Training Methodology and Task Evaluation
To adapt NLP pipelines to Czech legal language, both ModernBERT (encoder, 395M parameters) and Llama 3.1 8B (decoder) models underwent continued pretraining (CPT) on 300,000 anonymized decisions:
- Llama 3.1 8B: 1 epoch, batch size 16, max length 32,000 tokens, unsloth optimization.
- ModernBERT: Two-phase CPT: custom BPE tokenizer (mask 30%, batch 32, max length 8,192), followed by standard MLM (mask 15%, batch 8).
For argument-type multilabel tasks, Binary Cross-Entropy (BCE) loss was compared to Asymmetric Loss (ASL) to handle imbalance. ASL loss:
with hyperparameters , , .
Task structure:
- Task 1: Argument-presence detection (binary paragraph classification).
- Task 2: Argument-type multilabel paragraph classification.
- Task 3: Decision-level formalism classification.
Three model families were used: Llama 3.1 8B (full fine-tuning, LoRA PEFT), ModernBERT, and a feature-based MLP. The optimal pipeline for Task 3: ModernBERT filters paragraphs, Llama 3.1 (ASL-trained) classifies argument types, and an MLP predicts formalisticity from extracted frequencies.
5. Results and Performance Benchmarks
Performance is reported using macro-F1 scores. For C-class tasks:
For multilabel Task 2, positive and negative F1 for each class are averaged:
Key results are as follows:
| Task / Model | Macro-F1 |
|---|---|
| Argument-Presence Detection | |
| ModernBERT-large-Czech-Legal CPT | 82.6% (best) |
| Llama 3.1 8B Base | 79.5% |
| Baselines (majority/random) | 46.6% / 42.0% |
| Argument-Type Classification | |
| Llama 3.1 8B Base + ASL | 77.5% (best) |
| ModernBERT CPT + ASL | 71.6% |
| Baselines | 49.4% / 35.5% |
| Holistic Formalism Classification | |
| Multi-stage pipeline (ModernBERT→Llama→MLP) | 83.2% (best) |
| ModernBERT-large-Czech-Legal CPT | 82.2% |
| End-to-end Llama 3.1 8B Base | 75.9% |
| Baselines | 36.9% / 41.4% |
| Oracle (Gold-feature MLP) | 91.7% |
The three-stage pipeline demonstrates enhanced performance, computational efficiency (by filtering ~85% of non-argumentative text), and increased explainability.
6. Empirical Findings and Theoretical Significance
The empirical analysis refutes the prevailing narrative of persistent, text-bound formalism in CEE apex courts. Key findings include:
- From 2003–2011, both apex courts exhibited nearly equal levels of formalistic reasoning (~60%).
- Post-2011, the Supreme Administrative Court evidences a pronounced shift, with over 70% non-formalistic decisions, while the Supreme Court's proportion remains stable.
- Case law (CL) citations dominate (37.4%), followed by TI+PL (32.3%). LIN arguments are rare (5.7%).
- Czech judicial reasoning evidences precedent-orientation and purposive reasoning, challenging stereotypes of rigid legal formalism in the region.
A plausible implication is that computational legal argument mining, as instantiated by MADON, can robustly distinguish between judicial philosophies, and that prior comparative-legal typologies require revision in light of empirical evidence.
7. Availability, Licensing, and Replicability
The MADON dataset, accompanying 300,000-document CPT corpus, comprehensive guidelines (PDF and summary), model checkpoints, and training pipelines are publicly available at https://github.com/trusthlt/madon/. The dataset is licensed under CC BY 4.0; code under MIT/Apache 2.0. The methodology—including the annotation protocol, CPT on domain-specific corpora, the use of ASL loss for class imbalance, and the three-stage classification pipeline—is explicitly designed for replicability across languages and jurisdictions. This design facilitates comparative studies in computational legal analysis, including applications to other European and US court systems.
For full details, including per-label F1 metrics and appendices, see (Koref et al., 12 Dec 2025).