Arguinas: Dataset for Argument Reconstruction
- Arguinas is a dataset of natural-language arguments annotated with explicit premise–conclusion reconstructions for critical thinking tasks.
- It employs a six-stage GAAR pipeline integrating LLM judgment, formal verification via a SAT solver, and faithfulness tests to ensure logical rigor.
- The corpus, featuring 2,850 instances from diverse sources, benchmarks models in argument mining, reasoning, and legal inference.
Arguinas is a large-scale, high-quality dataset of natural-language arguments annotated with explicit, logically structured premise–conclusion reconstructions. It was introduced as the core artifact for training and evaluating models on argument reconstruction and related critical thinking tasks. The dataset is generated by the GAAR (Generalized Automatic Argument Reconstruction) engine—a symbolic–neural system that guarantees both logical validity and faithful representation of the original arguments—enabling robust benchmarking and research on reasoning, argument mining, and LLM-based critical thinking (Ryu et al., 18 Mar 2026).
1. Methodological Foundations: The GAAR Pipeline
The Arguinas dataset was synthesized via GAAR, a six-stage, iterative reconstruction pipeline. At each stage, outputs are scrutinized and refined according to a combination of LLM judgement (Claude Sonnet 4.5), symbolic reasoning (SAT-based formal verification), and LLM-based faithfulness tests:
- Fallacy Detection: The LLM identifies formal or informal fallacies (e.g., affirming the consequent, hasty generalization), influencing subsequent processing steps. If a formal fallacy is detected, validity checking is bypassed.
- Initial Argument Reconstruction: The LLM is provided with the argument, formal reasoning types (deduction, induction, abduction, analogy), and fine-grained Walton schemes. It emits a set of premises (explicit and implicit) and a single conclusion .
- Formalization to First-Order Logic: Each premise and the conclusion are translated into FOL, with a mapping (“Keys” table) from NL elements to logical predicates and constants.
- Validity Checking & Premise Pruning: A SAT solver tests if the FOL premises entail the FOL conclusion. If inferential validity fails—and no fallacy was detected—feedback is given for LLM resynthesis. The SAT solver also identifies minimal supporting premise subsets, pruning unused premises.
- Streamlining (NL Back-Translation): The set of minimal supporting FOL formulas is back-translated to a streamlined, logically ordered NL reconstruction.
- Faithfulness Judgment: A separate LLM judge applies three faithfulness criteria—accuracy, completeness, parsimony. On any failure, targeted feedback is delivered; otherwise, the reconstruction is accepted.
This pipeline ensures each instance is deductively valid (where appropriate), faithful to the source, and logically minimal.
2. Dataset Structure and Formatting
Each Arguinas entry comprises:
- (Optional) Title / Debate Topic
- (Optional) Context / Background
- Raw Argument: One or more NL paragraphs.
- Reconstructed Argument:
- Numbered premises , , …, with explicit/implicit status indicated.
- A single conclusion .
Internally, each premise and the conclusion are also formalized in FOL, but public releases contain only the NL forms (formulas can be regenerated if needed). The structure is illustrated in the following abstract format:
1 2 3 4 5 |
The death penalty deters crime because criminals fear dying... P₁: Criminals fear dying. (explicit) P₂ (Implicit): If criminals fear dying, then they are less likely to commit crimes if the death penalty exists. C: The death penalty deters crime. |
3. Data Sources and Corpus Statistics
Arguinas covers 2,850 argument–reconstruction pairs drawn from seven diverse origins:
| Source Name | Instances | Avg Words/Arg | Avg # Premises | % Implicit Premises |
|---|---|---|---|---|
| procon.org | 282 | 178.4 ± 92.5 | 6.45 ± 2.98 | 45.5 ± 19.4 |
| Pros-and-cons 1950 | 119 | 43.0 ± 8.8 | 5.59 ± 1.81 | 47.5 ± 17.9 |
| Pros-and-cons 2010 | 373 | 85.3 ± 25.7 | 5.80 ± 2.21 | 51.9 ± 18.0 |
| NYT Room for Debate | 297 | 393.5 ± 86.3 | 8.11 ± 3.38 | 41.8 ± 15.5 |
| Anthropic-Persuasion | 287 | 253.9 ± 35.8 | 8.52 ± 3.17 | 40.6 ± 17.7 |
| Synthetic non-fallacious | 1,318 | 330.7 ± 194.1 | 9.47 ± 4.34 | 37.2 ± 14.8 |
| Synthetic fallacious | 174 | 259.4 ± 135.3 | 7.23 ± 3.63 | 33.6 ± 21.0 |
| Total | 2,850 | 266.7 ± 179.6 | 8.09 ± 3.90 | 41.3 ± 17.3 |
The dataset is split into 2,565 train (90%) and 285 test (10%) instances.
4. Annotation Schema and Quality Metrics
Annotation in Arguinas is entirely automated via GAAR. Nevertheless, two external validation procedures are reported:
- NL→FOL translation accuracy: 99.04% (207/209) achieved in expert spot-checks, indicating that logical formulations faithfully matched their NL counterparts.
- Faithfulness judgment reliability: Comparing human expert and LLM judge on the three faithfulness criteria—Cohen’s , accuracy of 89.5% (ties excluded)—demonstrates substantial alignment but not perfect equivalence.
Each entry is systematically labeled with:
- Source identifier.
- Author identity type (human editor, journalist, educator, GPT-5 family model).
- Argument type according to GAAR’s taxonomy (deduction, induction, abduction, analogy, or one of 60 Walton schemes).
- Fallacy presence/type (formal, informal, none).
These allow for source stratification and targeted experimentation.
5. Usage Scenarios and Example Instances
Arguinas is optimized for:
- Training and evaluation of models specifically on argument reconstruction, a crucial subtask in critical thinking and logical reasoning.
- Downstream tasks such as argument evaluation, legal reasoning, and logical inference benchmarks.
- Probing LLMs for faithfulness, structured reasoning, and ability to distinguish valid from fallacious inferences.
Examples from the corpus demonstrate its broad domain coverage and structured annotation, including both explicit and implicit premise identification.
6. Distinctive Features and Research Relevance
Arguinas is distinguished from prior datasets by:
- Logical rigor: Every (non-fallacious) reconstruction is SAT-verified for validity.
- Faithfulness: Tri-criteria assessment (accuracy, completeness, parsimony) enforces close adherence to argument meaning.
- Scale and domain-generality: The corpus is multi-source, cross-topic, and incorporates both naturally-occurring and synthetic (including fallacious) material.
- Structural richness: Explicit marking of premise status and fine-grained type annotation.
- Automated provenance: Each instance’s full synthesis metadata is preserved for reproducibility and selective data augmentation.
A plausible implication is that Arguinas provides a strong testbed for the development and assessment of models that not only analyze argumentative text but must also reconstruct, formalize, and reason over complex combinations of implicit and explicit premises. Experimental results confirm that models trained on argument reconstruction—especially with Arguinas supervision—outperform models without such training on a range of critical thinking tasks (Ryu et al., 18 Mar 2026).