Arguinas: Dataset for Argument Reconstruction

Updated 25 March 2026

Arguinas is a dataset of natural-language arguments annotated with explicit premise–conclusion reconstructions for critical thinking tasks.
It employs a six-stage GAAR pipeline integrating LLM judgment, formal verification via a SAT solver, and faithfulness tests to ensure logical rigor.
The corpus, featuring 2,850 instances from diverse sources, benchmarks models in argument mining, reasoning, and legal inference.

Arguinas is a large-scale, high-quality dataset of natural-language arguments annotated with explicit, logically structured premise–conclusion reconstructions. It was introduced as the core artifact for training and evaluating models on argument reconstruction and related critical thinking tasks. The dataset is generated by the GAAR (Generalized Automatic Argument Reconstruction) engine—a symbolic–neural system that guarantees both logical validity and faithful representation of the original arguments—enabling robust benchmarking and research on reasoning, argument mining, and LLM-based critical thinking (Ryu et al., 18 Mar 2026).

1. Methodological Foundations: The GAAR Pipeline

The Arguinas dataset was synthesized via GAAR, a six-stage, iterative reconstruction pipeline. At each stage, outputs are scrutinized and refined according to a combination of LLM judgement (Claude Sonnet 4.5), symbolic reasoning (SAT-based formal verification), and LLM-based faithfulness tests:

Fallacy Detection: The LLM identifies formal or informal fallacies (e.g., affirming the consequent, hasty generalization), influencing subsequent processing steps. If a formal fallacy is detected, validity checking is bypassed.
Initial Argument Reconstruction: The LLM is provided with the argument, formal reasoning types (deduction, induction, abduction, analogy), and fine-grained Walton schemes. It emits a set of premises $P_1,\ldots, P_k$ (explicit and implicit) and a single conclusion $C$ .
Formalization to First-Order Logic: Each premise and the conclusion are translated into FOL, with a mapping (“Keys” table) from NL elements to logical predicates and constants.
Validity Checking & Premise Pruning: A SAT solver tests if the FOL premises entail the FOL conclusion. If inferential validity fails—and no fallacy was detected—feedback is given for LLM resynthesis. The SAT solver also identifies minimal supporting premise subsets, pruning unused premises.
Streamlining (NL Back-Translation): The set of minimal supporting FOL formulas is back-translated to a streamlined, logically ordered NL reconstruction.
Faithfulness Judgment: A separate LLM judge applies three faithfulness criteria—accuracy, completeness, parsimony. On any failure, targeted feedback is delivered; otherwise, the reconstruction is accepted.

This pipeline ensures each instance is deductively valid (where appropriate), faithful to the source, and logically minimal.

2. Dataset Structure and Formatting

Each Arguinas entry comprises:

(Optional) Title / Debate Topic
(Optional) Context / Background
Raw Argument: One or more NL paragraphs.
Reconstructed Argument:
- Numbered premises $P_1$ , $P_2$ , …, with explicit/implicit status indicated.
- A single conclusion $C$ .

Internally, each premise and the conclusion are also formalized in FOL, but public releases contain only the NL forms (formulas can be regenerated if needed). The structure is illustrated in the following abstract format:

The death penalty deters crime because criminals fear dying...

P₁: Criminals fear dying. (explicit)
P₂ (Implicit): If criminals fear dying, then they are less likely to commit crimes if the death penalty exists.
C: The death penalty deters crime.

3. Data Sources and Corpus Statistics

Arguinas covers 2,850 argument–reconstruction pairs drawn from seven diverse origins:

Source Name	Instances	Avg Words/Arg	Avg # Premises	% Implicit Premises
procon.org	282	178.4 ± 92.5	6.45 ± 2.98	45.5 ± 19.4
Pros-and-cons 1950	119	43.0 ± 8.8	5.59 ± 1.81	47.5 ± 17.9
Pros-and-cons 2010	373	85.3 ± 25.7	5.80 ± 2.21	51.9 ± 18.0
NYT Room for Debate	297	393.5 ± 86.3	8.11 ± 3.38	41.8 ± 15.5
Anthropic-Persuasion	287	253.9 ± 35.8	8.52 ± 3.17	40.6 ± 17.7
Synthetic non-fallacious	1,318	330.7 ± 194.1	9.47 ± 4.34	37.2 ± 14.8
Synthetic fallacious	174	259.4 ± 135.3	7.23 ± 3.63	33.6 ± 21.0
Total	2,850	266.7 ± 179.6	8.09 ± 3.90	41.3 ± 17.3

The dataset is split into 2,565 train (90%) and 285 test (10%) instances.

4. Annotation Schema and Quality Metrics

Annotation in Arguinas is entirely automated via GAAR. Nevertheless, two external validation procedures are reported:

NL→FOL translation accuracy: 99.04% (207/209) achieved in expert spot-checks, indicating that logical formulations faithfully matched their NL counterparts.
Faithfulness judgment reliability: Comparing human expert and LLM judge on the three faithfulness criteria—Cohen’s $\kappa = 0.5361$ , accuracy of 89.5% (ties excluded)—demonstrates substantial alignment but not perfect equivalence.

Each entry is systematically labeled with:

Source identifier.
Author identity type (human editor, journalist, educator, GPT-5 family model).
Argument type according to GAAR’s taxonomy (deduction, induction, abduction, analogy, or one of 60 Walton schemes).
Fallacy presence/type (formal, informal, none).

These allow for source stratification and targeted experimentation.

5. Usage Scenarios and Example Instances

Arguinas is optimized for:

Training and evaluation of models specifically on argument reconstruction, a crucial subtask in critical thinking and logical reasoning.
Downstream tasks such as argument evaluation, legal reasoning, and logical inference benchmarks.
Probing LLMs for faithfulness, structured reasoning, and ability to distinguish valid from fallacious inferences.

Examples from the corpus demonstrate its broad domain coverage and structured annotation, including both explicit and implicit premise identification.

6. Distinctive Features and Research Relevance

Arguinas is distinguished from prior datasets by:

Logical rigor: Every (non-fallacious) reconstruction is SAT-verified for validity.
Faithfulness: Tri-criteria assessment (accuracy, completeness, parsimony) enforces close adherence to argument meaning.
Scale and domain-generality: The corpus is multi-source, cross-topic, and incorporates both naturally-occurring and synthetic (including fallacious) material.
Structural richness: Explicit marking of premise status and fine-grained type annotation.
Automated provenance: Each instance’s full synthesis metadata is preserved for reproducibility and selective data augmentation.

A plausible implication is that Arguinas provides a strong testbed for the development and assessment of models that not only analyze argumentative text but must also reconstruct, formalize, and reason over complex combinations of implicit and explicit premises. Experimental results confirm that models trained on argument reconstruction—especially with Arguinas supervision—outperform models without such training on a range of critical thinking tasks (Ryu et al., 18 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Argument Reconstruction as Supervision for Critical Thinking in LLMs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Arguinas Dataset.