Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains (2506.02126v1)

Published 2 Jun 2025 in cs.CL

Abstract: Recent advances in reasoning-enhanced LLMs such as OpenAI-o1/3 and DeepSeek-R1 have significantly improved performance on complex tasks. However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decomposing the thinking trajectories into two parts: knowledge and reasoning. Specifically, we introduce a fine-grained evaluation framework that judges: (1) the correctness of knowledge used (measured by Knowledge Index (KI)) and (2) the quality of reasoning (measured by Information Gain (InfoGain)). Using this framework, we study R1-distilled and base Qwen models trained with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains. Three intriguing findings emerge: (1) The general reasoning abilities in R1-distilled models do not transfer effectively to the medical domain through either SFT or RL. (2) SFT raises final-answer accuracy in both domains, but often at the cost of reasoning quality: InfoGain drops by 38.9% on average compared with untrained models; In the medical domain, however, SFT remains crucial because domain knowledge is indispensable. (3) RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, thereby improving both reasoning accuracy and knowledge correctness.

Summary

  • The paper presents an evaluation framework that decomposes LLM reasoning into discrete stages using the Knowledge Index and InfoGain metrics.
  • The study finds that supervised fine-tuning boosts final-answer accuracy while reducing reasoning quality, with RL improving medical reasoning effectiveness.
  • The findings imply that tailoring training methods to domain-specific needs enhances model reliability and interpretability in high-stakes applications like medicine.

Analysis of Reasoning and Knowledge in LLMs Across Domains

The paper "Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains" explores the nuanced interplay between the knowledge and reasoning abilities of LLMs. The authors focus on decomposing the reasoning processes of LLMs into discrete stages, aiming to better understand how these models function when faced with tasks requiring different types of cognitive effort, particularly within the medical and mathematical domains. This paper moves beyond simply assessing final-answer accuracy and endeavors to provide a detailed evaluation framework to scrutinize the step-by-step reasoning employed by these models.

Evaluation Framework

The paper introduces a novel evaluation framework consisting of two metrics: Knowledge Index (KI) and Information Gain (InfoGain). KI measures the correctness of knowledge used in each reasoning step as verified against factual sources. Meanwhile, InfoGain assesses how informative each reasoning step is towards arriving at the final answer, quantified via changes in model uncertainty from step to step.

Model Comparisons and Findings

The authors conduct experiments using two prominent LLMs: Qwen-Base (Qwen2.5-7B) and DeepSeek-R1 distilled Qwen (R1-distilled). These models are evaluated after undergoing Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), respectively. Several critical observations arise from these comparisons:

  1. Domain Adaptability: The results demonstrate that general reasoning abilities in R1-distilled models do not effectively transfer to the medical domain when using either SFT or RL methods. The Qwen-Base model, when fine-tuned in the medical domain, generally outperformed the R1-distilled variant, suggesting potential domain-specific training conflicts.
  2. Impact of SFT and RL: The paper finds that SFT helps boost final-answer accuracy but negatively impacts reasoning quality, with InfoGain dropping by 38.9% relative to untrained models. RL, on the other hand, enhances medical reasoning by improving both reasoning accuracy and knowledge correctness, showcasing its ability to prune inaccurate or irrelevant information from the reasoning process.
  3. Knowledge vs. Reasoning: The analysis indicates that knowledge plays a more pivotal role in medical tasks as opposed to mathematical ones, where reasoning is more critical. This is evidenced by higher correlations between KI and accuracy compared to InfoGain and accuracy in the medical domain.

Implications and Future Prospects

The findings from this paper have significant implications for how future AI models might be trained and evaluated. Given the distinct demands of different domains, LLM training strategies could be customized to prioritize either knowledge acquisition or reasoning enhancement as needed.

Furthermore, as the research advocates for the separation of knowledge and reasoning evaluations, it paves the way for more transparent and targeted approaches to improving LLM performance in specific tasks. Such insights could lead to more effective domain-specific LLMs that are more reliable and interpretable, especially in high-stakes applications like medicine.

Conclusion

This paper contributes substantively to the understanding of LLM reasoning processes and the interaction between knowledge and reasoning across domains. By offering a detailed evaluation framework and highlighting the differences in domain requirements, it equips researchers and practitioners with new tools and perspectives for advancing the development and application of LLMs. The proposed framework holds promise for extending these insights into other complex domains where structured reasoning is critical.