ARCHE Bench: Scientific Reasoning Benchmark
- ARCHE Bench is a benchmark dataset that quantifies LLMs’ ability to reconstruct formal scientific reasoning chains via graph-based Reasoning Logic Trees.
- It employs a structured annotation protocol assigning inference types based on Peirce’s paradigms, enforcing valid edge-pairing constraints in logical relations.
- Evaluation metrics like Entity Coverage and Reasoning Edge Accuracy reveal trade-offs between content completeness and logical rigor in LLM performance.
ARCHE Bench is a benchmark dataset and evaluation suite developed for quantifying the ability of LLMs to extract explicit, paradigm-grounded chains of scientific reasoning from scholarly texts. Unlike conventional chain-of-thought tasks, ARCHE focuses on reconstructing the underlying formal logical structure implicit in scientific argumentation, based on Peirce's paradigms of deduction, induction, and abduction. The benchmark operates over published literature, assembling a logically-categorized directed acyclic graph—termed a Reasoning Logic Tree (RLT)—from Introduction passages and cited abstracts. This contextually anchored, graph-based approach allows rigorous assessment of both content completeness and step-by-step logical validity in scientific reasoning by LLMs (Li et al., 16 Nov 2025).
1. Rationale and Conceptual Framework
The motivation behind ARCHE Bench stems from the observation that LLMs often produce fluent yet structurally informal chains of reasoning that lack explicit grounding in scientific inference paradigms. Surface-level chain-of-thought narratives may obscure whether a model truly understands and can operate within formal logic categories. ARCHE Bench addresses this by tasking models with reconstructing the full latent reasoning chain as an RLT, ensuring each inferential step is (a) grounded in the source text, (b) unambiguously assigned to a formal paradigm, and (c) arranged in a tree structure with precisely labeled edge types. This framework targets the core challenge in scientific reasoning: mapping complex arguments to a canonical set of transparent, typologically-constrained inference steps.
2. Benchmark Construction and Dataset Statistics
ARCHE Bench is derived from 70 peer-reviewed, open-access articles from Nature Communications, covering Physical Sciences (35) and Biological Sciences (35), each selected to maximize scientific rigor and domain balance. Key statistics are:
| Statistic | Value | Comments |
|---|---|---|
| Number of articles | 70 (35 per domain) | Physical/Biological |
| Introduction sentences (total) | 2,164 | Avg. 30.9/article |
| Cited references | 1,891 | Avg. 27.0/article |
| Extracted viewpoints | 38,739 | Avg. 77.4/article |
For each article, viewpoints—atomic semantic units of fact or inference—are systematically extracted from Introduction sentences using GPT-4o prompt templates. Abstracts of cited references are retrieved via the Semantic Scholar API, and their viewpoints similarly extracted. The dataset does not implement train/validation/test splits; instead, evaluation occurs in a zero-shot mode, with every article processed independently.
3. Annotation Protocol and Reasoning Logic Tree Formalism
Each viewpoint is annotated with a three-integer source coordinate: (x,0,0) for the x-th Introduction sentence, (x,y,0) for the y-th viewpoint extracted from sentence x, (x,y,z>0) for viewpoints extracted from the z-th cited reference by sentence x, and (0,0,0) for implicit knowledge or intermediate nodes. Categories include Introduction sentence, Introduction viewpoint, Reference viewpoint, and Implicit knowledge.
The RLT is represented as a single-rooted directed acyclic graph in DOT format, featuring:
- Nodes: Each labeled by its source coordinate and a succinct transcription.
- Edges: Each edge labeled by one of six fine-grained types mapped to Peirce’s three paradigms:
- Deduction: deduction-rule (DR), deduction-case (DC)
- Induction: induction-case (ICa), induction-common (ICo)
- Abduction: abduction-phenomenon (AP), abduction-knowledge (AK)
Every inference step must obey strict edge-pairing constraints:
- Deductive step: DR + DC → conclusion
- Inductive step: ICa + ICo → generalization
- Abductive step: AP + AK → hypothesis
This explicit formalism enforces transparent paradigm allegiance for every node and edge, facilitating reproducible and interpretable reasoning chain extraction.
4. Evaluation Metrics and Protocols
ARCHE Bench utilizes two logic-aware metrics:
- Entity Coverage (EC): Quantifies the fraction of core gold-standard entities present in the valid nodes of a model’s RLT.
- Reasoning Edge Accuracy (REA): Measures the proportion of inference steps where the conclusion correctly follows from premises within the declared paradigm, validated by a three-model majority vote.
Structural violations (incorrect edge-pairing, multi-root graphs, etc.) penalize the REA score, and omission of core entities impacts EC.
5. LLM Performance and Failure Modes
Ten leading LLMs were evaluated on ARCHE Bench in zero-shot conditions. Results demonstrate:
| Model | REA (%) | EC (%) |
|---|---|---|
| Gemini-2.5-Pro | 39.5 | 56.7 |
| o3 | 35.6 | 60.5 |
| Grok-3 | 33.1 | 53.8 |
- Average EC: 51.4% (median 66.7%), indicating broad but incomplete entity capture.
- Average REA: 28.3% (median 25%), signifying low logical validity under strict inference paradigm constraints.
No model achieved a complete and standard reasoning chain. There exists a Pareto trade-off frontier between EC and REA: models with high content coverage tended to underperform in logical rigor, and vice versa.
Prominent failure modes include:
- Structural Violations: Breach of edge-pairing or root constraints.
- Format Errors: Malformed DOT syntax, incorrect coordinate formatting.
- Paradigm Misclassification: Mislabeled inference types.
- Logical Invalidity: Invalid conclusions despite correct labels.
- Omissions: Non-coverage of key entities or viewpoints.
6. Scientific and Methodological Significance
ARCHE Bench exposes a critical gap between the surface-level fluency of current LLMs and the structural rigor required for robust scientific reasoning. Models repeatedly violate paradigmatic constraints and produce incomplete or invalid logic trees, suggesting shallow extraction strategies not grounded in typological formalism. The revealed trade-off between coverage and validity highlights methodological limits in existing architectures and training protocols.
A plausible implication is that paradigm-awareness and structured supervision—potentially via pre-training or instruction-tuning explicitly incorporating deductive/inductive/abductive paradigms—may be necessary for LLMs to reach the demands of rigorous scientific argumentation. The precise annotation scheme, evaluation metrics, and public benchmark reproducibility of ARCHE Bench establish a foundation for future research in paradigm-guided reasoning extraction, transparent scientific inference modeling, and structured argumentation in LLMs (Li et al., 16 Nov 2025).
7. Broader Impacts and Future Directions
ARCHE Bench provides a reproducible and extensible resource for assessing the paradigm-grounded reasoning capabilities of LLMs in scientific domains. Its strict formalism enables targeted improvement of model architectures, annotation guidelines, and reasoning chain extraction protocols. Further work may entail integration of additional domains, development of more sophisticated graph validation metrics, and exploration of advanced pre-training strategies.
The results suggest a pressing need for next-generation LLMs capable of automatically aligning their generated reasoning chains to explicit scientific paradigms, meeting the transparency, accuracy, and completeness standards demanded by scholarly argumentation. Future directions include paradigm-guided pre-training, fine-grained instruction tuning, and multi-model ensembles for robust logic validation.