- The paper introduces SCIREAS benchmarks and the KRUX framework that disentangles knowledge retrieval from reasoning in scientific tasks.
- The authors show that increasing reasoning steps leads to significant performance gains, with improvements up to 12.22 points on SCIREAS-PRO.
- The study highlights that reasoning fine-tuning and external knowledge injection enhance model performance, informing robust training and deployment strategies.
Demystifying Scientific Problem-Solving in LLMs: Probing Knowledge and Reasoning
Introduction
This paper addresses the challenge of scientific problem-solving in LLMs, focusing on the interplay between domain knowledge and complex reasoning. The authors introduce SCIREAS, a unified benchmark suite for scientific reasoning, and SCIREAS-PRO, a reasoning-intensive subset. They further propose KRUX, a probing framework to disentangle the roles of knowledge and reasoning in scientific tasks. Through controlled experiments and analytic frameworks, the paper provides empirical insights into the bottlenecks and synergies of knowledge retrieval and reasoning in LLMs, with implications for model training, evaluation, and deployment in scientific domains.
Benchmark Construction: SCIREAS and SCIREAS-PRO
The fragmentation of existing scientific benchmarks—each narrowly focused on specific domains or formats—motivates the construction of SCIREAS. SCIREAS merges ten prominent benchmarks (e.g., GPQA, MMLU-Pro, LabBench, OlympiadBench, SciBench, SciRIFF, UGPhysics, SciEval, SciKnowEval, SuperGPQA) under a standardized evaluation harness, covering physics, chemistry, biology, medicine, materials science, mathematics, computer science, and engineering. The suite includes multiple-choice, fill-in-the-blank, structured, and procedural questions, with manual curation to ensure each instance requires both deep domain knowledge and multi-step reasoning.
SCIREAS-PRO is derived by filtering for instances where only high inference-time compute (i.e., increased "thinking tokens") enables correct answers, isolating tasks that demand genuine reasoning beyond mere knowledge recall. The selection process is validated by cross-model agreement and human/LLM-as-judge assessments, confirming that SCIREAS-PRO instances are reasoning-intensive.
Model Evaluation and Reasoning Budget Analysis
Frontier models (e.g., OpenAI o-series, DeepSeek-R1, Gemini-2.5-Pro, Claude-Sonnet-4, Qwen3-32B, Llama-4-Maverick) are evaluated on SCIREAS and SCIREAS-PRO under varying reasoning-effort settings. The results demonstrate that increasing the reasoning budget (i.e., allowing more intermediate reasoning steps) yields substantial performance gains, with amplified gaps on SCIREAS-PRO. For example, the performance difference between low and high reasoning settings for GPT-5 widens from 3.01 to 12.22 points on SCIREAS-PRO. This finding underscores the importance of reasoning capacity and test-time compute in scientific problem-solving.
Benchmark correlation analysis reveals that individual benchmarks are not highly correlated, especially between multiple-choice and free-form QA formats, justifying the need for a holistic evaluation suite. Models tuned for specific tasks may outperform higher-ranked models on those benchmarks, but SCIREAS provides a more comprehensive assessment of scientific reasoning capabilities.
Disentangling Knowledge and Reasoning: The KRUX Framework
KRUX is introduced as a probing framework to isolate the effects of knowledge and reasoning. The pipeline extracts atomic, answer-agnostic knowledge ingredients (KIs) from reasoning traces of strong models (e.g., DeepSeek-R1) and supplies them in-context to target models. This controlled setting enables the paper of three key research questions:
- Knowledge Retrieval Bottleneck: Base instruct models, when provided with high-quality KIs, can outperform reasoning-fine-tuned models by ≥10%, indicating that internalizing and retrieving task-relevant knowledge is a critical bottleneck for scientific reasoning.
- Complementary Gains from External Knowledge: Reasoning-enhanced models also benefit from in-context KIs, achieving additional improvements over their base performance. This suggests that explicit access to external knowledge complements reasoning capabilities.
- Reasoning Fine-Tuning Improves Knowledge Surfacing: KIs extracted from reasoning-fine-tuned models (e.g., math-only fine-tuning) enable greater performance boosts for base models than KIs from base models themselves, even when no new domain knowledge is introduced. This demonstrates that reasoning fine-tuning enhances the model's ability to surface and utilize latent knowledge.
Empirical results show that base models with DeepSeek-R1 KIs outperform both base and reasoning-fine-tuned models without KIs by ≥20% on GPQA and LabBench*, and reasoning models with R1 KIs further improve over base models with R1 KIs. Synthetic probing confirms that the improvements are due to better knowledge surfacing rather than new knowledge acquisition.
Training and Data Composition for Scientific Reasoning
The paper compares various post-training recipes for scientific reasoning, including fine-tuning on SYNTHETIC-1-Math, SYNTHETIC-1-STEM, and their combination. Models trained on both math and STEM reasoning traces (Qwen-BOTH, Llama-BOTH) achieve the strongest performance on SCIREAS and SCIREAS-PRO, outperforming concurrent recipes (e.g., SYNTHETIC-1-SFT, Llama-Nemotron, General-Reasoner). The release of SCILIT01, a Qwen3-8B-Base model fine-tuned with the Math+STEM mixture, provides a strong open-source baseline for scientific reasoning.
Analysis of math vs. non-math instances in SCIREAS-PRO reveals that math reasoning fine-tuning primarily improves performance on math-intensive tasks, while STEM fine-tuning benefits non-math scientific domains. This highlights the importance of data composition in post-training for domain-specific reasoning.
Implementation Considerations and Limitations
The KRUX framework relies on strong reasoning models for KI extraction, with manual validation to ensure relevance and answer-agnosticity. Context sensitivity and search space constraints are addressed through random permutations and controlled experiments. The paper focuses on moderate-sized open-weight models (<10B parameters), limiting generalizability to larger models. The benchmarks emphasize STEM fields, potentially underrepresenting interdisciplinary research. Data contamination and context effects are mitigated by focusing on recent datasets and analytical protocols.
Implications and Future Directions
The findings have several practical and theoretical implications:
- Model Training: Reasoning-fine-tuning not only improves deductive capabilities but also enhances knowledge recall and utilization. Data composition (Math+STEM) is critical for robust scientific reasoning.
- Evaluation: Holistic benchmarks like SCIREAS are necessary for fair and comprehensive assessment of scientific reasoning in LLMs. Task-specific evaluation is recommended for optimal cost-performance trade-offs.
- Deployment: External knowledge augmentation (e.g., retrieval-augmented generation, KI injection) can substantially improve performance, especially for base models with limited parametric knowledge.
- Research: The disentanglement of knowledge and reasoning opens avenues for modular architectures, explicit memory modules, and adaptive reasoning strategies in LLMs.
Future work should extend the analysis to larger models, interdisciplinary domains, and real-world scientific workflows. The integration of retrieval systems, external knowledge bases, and dynamic reasoning modules may further enhance scientific problem-solving in LLMs.
Conclusion
This paper provides a systematic framework for evaluating and improving scientific reasoning in LLMs by disentangling the roles of knowledge and reasoning. The introduction of SCIREAS, SCIREAS-PRO, and KRUX enables controlled, reproducible analysis across domains and formats. The empirical findings demonstrate that knowledge retrieval is a key bottleneck, external knowledge consistently benefits reasoning models, and reasoning fine-tuning improves knowledge surfacing. These insights inform the design, training, and deployment of LLMs for scientific applications, with implications for future research in AI-driven scientific discovery.