SciReas: Scientific Reasoning Benchmark
- SciReas is a comprehensive benchmark suite integrating multiple public challenges to evaluate complex scientific reasoning in large language models.
- The SciReas-PRO subset isolates multi-hop multi-step tasks, highlighting performance differences when models utilize in-context knowledge injections.
- The KRUX framework systematically separates knowledge recall from deductive reasoning, leading to empirical improvements of over 10% in accuracy.
SciReas is a comprehensive scientific reasoning benchmark and probing framework explicitly designed to analyze and dissect the interplay between knowledge retrieval and reasoning in LLMs on complex, domain-intensive scientific tasks. Developed to address the lack of unified evaluation for scientific problem solving, it unifies a diverse range of existing benchmarks and introduces systematic methodologies for probing and separating knowledge recall from deductive reasoning. The suite not only enables reproducible, cross-domain assessment of scientific reasoning but also introduces new standards for evaluation and training of robust scientific LLMs.
1. Benchmark Suite Design and Scope
The SciReas suite integrates ten major public benchmarks covering multiple scientific domains, including physics, chemistry, biology, medicine, materials science, mathematics, computer science, and engineering. It includes formats such as multiple-choice, fill-in-the-blank, structured response, and procedural/inference-driven questions. The primary curation criterion is that each instance must require both deep scientific knowledge and non-trivial reasoning (i.e., more than rote memorization).
Benchmarks comprising SciReas:
Component | Example Benchmarks Included |
---|---|
Science reasoning | GPQA, MMLU-Pro, SuperGPQA, LabBench, OlympiadBench, SciBench, SciRIFF, UGPhysics |
General science QA | SciEval, SciKnowEval |
This consolidation enables scientists to evaluate LLMs holistically, eliminating the fragmentation caused by previous practice of isolated evaluation.
2. SciReas-PRO: Reasoning-Intensive Subset
To isolate more demanding scientific reasoning, SciReas-PRO is curated as a compact but challenging subset. Its construction is based on both manual and quantitative filtering:
- Selection is limited to cases where solution requires multi-hop deduction and cannot be solved by simple fact retrieval.
- Empirically, selection is guided by observed difficulty under variable reasoning budgets (e.g., increasing the model's chain-of-thought step limit): problems retained are those for which success still depends on extended multi-step reasoning, even when all relevant information is available in context.
- SciReas-PRO contains roughly 8% as many examples as the full suite but is shown to better stratify distinctions between weak and strong scientific reasoners.
This strict filtering aligns the subset with goals of benchmarking models’ true reasoning capacity rather than general knowledge memorization.
3. KRUX: Probing Knowledge and Reasoning Separation
A central methodological contribution is the KRUX framework (“Knowledge & Reasoning Utilization eXams”), which operationalizes the separation of knowledge recall from reasoning capacity:
- Knowledge Ingredient Extraction: For each reasoning trace (chain-of-thought, CoT), KRUX algorithmically identifies “knowledge ingredients” (KIs)—atomic statements of fact, definition, or principle that are answer-agnostic.
- In-Context Injection: These KIs are injected into model prompts.
- Ablation/Comparison: Model performance is compared across three regimes: (i) solving with only model-internal knowledge; (ii) solving with only in-context KIs; (iii) solving with both reasoning and KIs.
This probing is formalized as:
- RQ1 (Base Model + KI): Does the base model’s performance improve when KIs are included in context?
- RQ2 (Reasoning-tuned Model + KI): Does a reasoning-fine-tuned model still benefit from extra KIs?
- RQ3 (KI Source Differentials): Does reasoning-fine-tuning change the utility or clarity of extracted KIs themselves?
Empirical tests show that even strong reasoning-tuned models substantially benefit from external KIs (“improvements on the order of ≥10% accuracy”), indicating that bottlenecks in scientific problem-solving often stem from knowledge retrieval failure rather than inference per se.
4. Experimental Findings and Ablations
The systematic KRUX probing, together with SciReas(-PRO), yields several crucial findings:
- Knowledge Retrieval Bottleneck: In several subdomains, base models sometimes exceed reasoning-fine-tuned models on raw accuracy when provided with relevant KIs, indicating that access to the right knowledge is a primary failure mode.
- Reasoning Fine-Tuning & KI Synergy: Enhanced models (trained on math/STEM chains-of-thought) benefit more from KI injection and, further, produce higher-quality, more relevant KIs in their own reasoning traces.
- Verbalized Reasoning Enhancement: Fine-tuning on extended chains-of-thought (math-specialized or STEM-specific synthetic tasks, e.g., SYNTHETIC-1-Math and SYNTHETIC-1-STEM) boosts not only problem-solving but the model’s propensity to elicit and articulate latent, parameter-stored facts crucial for task completion.
- Differentiation Capacity: The PRO subset proves effective at discriminating between model variants that are essentially indistinguishable on easier tasks, thus serving as a more granular diagnostic tool for advanced LLMs.
5. Comparison to Contemporary Approaches and Data Recipes
SciReas and its associated data recipes are contrasted with recent long chain-of-thought supervised fine-tuning efforts (e.g., General-Reasoner, Llama-Nemotron, OpenR):
- Rather than focusing solely on generic math or puzzle reasoning, the SciReas training recipe fuses abstract mathematical synthetic data with STEM-domain-specific chains-of-thought, enhancing real-world scientific reasoning transfer.
- Empirical comparisons (using Qwen-BOTH and SCILIT01) show that the Math+STEM-tuned 8B model matches or outperforms competing methods—despite moderate parameter scale—on standard scientific reasoning evaluation.
- This suggests data composition (domain specificity) may be as critical as increased model size or token budget for progress in domain science.
6. SCILIT01: Baseline Model for Scientific Reasoning
The Qwen3-8B-based SCILIT01 model is released as a reference scientific reasoning baseline:
- It is fine-tuned on the Math+STEM mixture using the SciReas data methodology.
- It achieves significant improvement over base Qwen3-8B in standard and PRO regimes, with additional gains possible via “thinking mode” (higher inference-time token budget).
- SCILIT01’s release serves as a reproducible starting point for further work on science-focused LLM evaluation, training, and probing.
7. Implications and Future Directions
SciReas and the KRUX methodology set new standards for the empirical and methodological analysis of scientific reasoning in LLMs:
- They demonstrate that deficits in scientific problem-solving are often rooted as much in knowledge retrieval as in the structure of reasoning.
- Future advances may be realized by targeting improved in-context knowledge identification, more targeted reasoning fine-tuning, or hybrid architectures that more efficiently combine parameterized and external knowledge.
- The open release of both benchmark and baseline models provides the infrastructure for reproducible, transparent, and cross-domain comparison in scientific LLM research.
These collective insights define a rigorous empirical and practical foundation for advancing automated scientific reasoning systems (Li et al., 26 Aug 2025).