Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 56 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 33 tok/s Pro

GPT-4o 102 tok/s Pro

Kimi K2 202 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning (2508.19202v1)

Published 26 Aug 2025 in cs.CL

Abstract: Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge. Finally, we conduct a lightweight analysis, comparing our science-focused data composition with concurrent efforts on long CoT SFT, and release SciLit01, a strong 8B baseline for scientific reasoning.

Summary

The paper introduces SCIREAS benchmarks and the KRUX framework that disentangles knowledge retrieval from reasoning in scientific tasks.
The authors show that increasing reasoning steps leads to significant performance gains, with improvements up to 12.22 points on SCIREAS-PRO.
The study highlights that reasoning fine-tuning and external knowledge injection enhance model performance, informing robust training and deployment strategies.

Demystifying Scientific Problem-Solving in LLMs: Probing Knowledge and Reasoning

Introduction

This paper addresses the challenge of scientific problem-solving in LLMs, focusing on the interplay between domain knowledge and complex reasoning. The authors introduce SCIREAS, a unified benchmark suite for scientific reasoning, and SCIREAS-PRO, a reasoning-intensive subset. They further propose KRUX, a probing framework to disentangle the roles of knowledge and reasoning in scientific tasks. Through controlled experiments and analytic frameworks, the paper provides empirical insights into the bottlenecks and synergies of knowledge retrieval and reasoning in LLMs, with implications for model training, evaluation, and deployment in scientific domains.

Benchmark Construction: SCIREAS and SCIREAS-PRO

The fragmentation of existing scientific benchmarks—each narrowly focused on specific domains or formats—motivates the construction of SCIREAS. SCIREAS merges ten prominent benchmarks (e.g., GPQA, MMLU-Pro, LabBench, OlympiadBench, SciBench, SciRIFF, UGPhysics, SciEval, SciKnowEval, SuperGPQA) under a standardized evaluation harness, covering physics, chemistry, biology, medicine, materials science, mathematics, computer science, and engineering. The suite includes multiple-choice, fill-in-the-blank, structured, and procedural questions, with manual curation to ensure each instance requires both deep domain knowledge and multi-step reasoning.

SCIREAS-PRO is derived by filtering for instances where only high inference-time compute (i.e., increased "thinking tokens") enables correct answers, isolating tasks that demand genuine reasoning beyond mere knowledge recall. The selection process is validated by cross-model agreement and human/LLM-as-judge assessments, confirming that SCIREAS-PRO instances are reasoning-intensive.

Model Evaluation and Reasoning Budget Analysis

Frontier models (e.g., OpenAI o-series, DeepSeek-R1, Gemini-2.5-Pro, Claude-Sonnet-4, Qwen3-32B, Llama-4-Maverick) are evaluated on SCIREAS and SCIREAS-PRO under varying reasoning-effort settings. The results demonstrate that increasing the reasoning budget (i.e., allowing more intermediate reasoning steps) yields substantial performance gains, with amplified gaps on SCIREAS-PRO. For example, the performance difference between low and high reasoning settings for GPT-5 widens from 3.01 to 12.22 points on SCIREAS-PRO. This finding underscores the importance of reasoning capacity and test-time compute in scientific problem-solving.

Benchmark correlation analysis reveals that individual benchmarks are not highly correlated, especially between multiple-choice and free-form QA formats, justifying the need for a holistic evaluation suite. Models tuned for specific tasks may outperform higher-ranked models on those benchmarks, but SCIREAS provides a more comprehensive assessment of scientific reasoning capabilities.

Disentangling Knowledge and Reasoning: The KRUX Framework

KRUX is introduced as a probing framework to isolate the effects of knowledge and reasoning. The pipeline extracts atomic, answer-agnostic knowledge ingredients (KIs) from reasoning traces of strong models (e.g., DeepSeek-R1) and supplies them in-context to target models. This controlled setting enables the paper of three key research questions:

Knowledge Retrieval Bottleneck: Base instruct models, when provided with high-quality KIs, can outperform reasoning-fine-tuned models by ≥10%, indicating that internalizing and retrieving task-relevant knowledge is a critical bottleneck for scientific reasoning.
Complementary Gains from External Knowledge: Reasoning-enhanced models also benefit from in-context KIs, achieving additional improvements over their base performance. This suggests that explicit access to external knowledge complements reasoning capabilities.
Reasoning Fine-Tuning Improves Knowledge Surfacing: KIs extracted from reasoning-fine-tuned models (e.g., math-only fine-tuning) enable greater performance boosts for base models than KIs from base models themselves, even when no new domain knowledge is introduced. This demonstrates that reasoning fine-tuning enhances the model's ability to surface and utilize latent knowledge.

Empirical results show that base models with DeepSeek-R1 KIs outperform both base and reasoning-fine-tuned models without KIs by ≥20% on GPQA and LabBench*, and reasoning models with R1 KIs further improve over base models with R1 KIs. Synthetic probing confirms that the improvements are due to better knowledge surfacing rather than new knowledge acquisition.

Training and Data Composition for Scientific Reasoning

The paper compares various post-training recipes for scientific reasoning, including fine-tuning on SYNTHETIC-1-Math, SYNTHETIC-1-STEM, and their combination. Models trained on both math and STEM reasoning traces (Qwen-BOTH, Llama-BOTH) achieve the strongest performance on SCIREAS and SCIREAS-PRO, outperforming concurrent recipes (e.g., SYNTHETIC-1-SFT, Llama-Nemotron, General-Reasoner). The release of SCILIT01, a Qwen3-8B-Base model fine-tuned with the Math+STEM mixture, provides a strong open-source baseline for scientific reasoning.

Analysis of math vs. non-math instances in SCIREAS-PRO reveals that math reasoning fine-tuning primarily improves performance on math-intensive tasks, while STEM fine-tuning benefits non-math scientific domains. This highlights the importance of data composition in post-training for domain-specific reasoning.

Implementation Considerations and Limitations

The KRUX framework relies on strong reasoning models for KI extraction, with manual validation to ensure relevance and answer-agnosticity. Context sensitivity and search space constraints are addressed through random permutations and controlled experiments. The paper focuses on moderate-sized open-weight models (<10B parameters), limiting generalizability to larger models. The benchmarks emphasize STEM fields, potentially underrepresenting interdisciplinary research. Data contamination and context effects are mitigated by focusing on recent datasets and analytical protocols.

Implications and Future Directions

The findings have several practical and theoretical implications:

Model Training: Reasoning-fine-tuning not only improves deductive capabilities but also enhances knowledge recall and utilization. Data composition (Math+STEM) is critical for robust scientific reasoning.
Evaluation: Holistic benchmarks like SCIREAS are necessary for fair and comprehensive assessment of scientific reasoning in LLMs. Task-specific evaluation is recommended for optimal cost-performance trade-offs.
Deployment: External knowledge augmentation (e.g., retrieval-augmented generation, KI injection) can substantially improve performance, especially for base models with limited parametric knowledge.
Research: The disentanglement of knowledge and reasoning opens avenues for modular architectures, explicit memory modules, and adaptive reasoning strategies in LLMs.

Future work should extend the analysis to larger models, interdisciplinary domains, and real-world scientific workflows. The integration of retrieval systems, external knowledge bases, and dynamic reasoning modules may further enhance scientific problem-solving in LLMs.

Conclusion

This paper provides a systematic framework for evaluating and improving scientific reasoning in LLMs by disentangling the roles of knowledge and reasoning. The introduction of SCIREAS, SCIREAS-PRO, and KRUX enables controlled, reproducible analysis across domains and formats. The empirical findings demonstrate that knowledge retrieval is a key bottleneck, external knowledge consistently benefits reasoning models, and reasoning fine-tuning improves knowledge surfacing. These insights inform the design, training, and deployment of LLMs for scientific applications, with implications for future research in AI-driven scientific discovery.