KRUX: Probing Knowledge and Reasoning in LLMs
- KRUX framework is a methodology that isolates and quantifies the impact of internal knowledge retrieval versus explicit reasoning in large language models.
- It uses expert-extracted atomic knowledge ingredients to augment prompts, enabling controlled experiments with benchmarks like SciReas and SciReas-Pro.
- Empirical results reveal that injecting external knowledge can boost LLM performance by 10–20 percentage points, addressing critical retrieval bottlenecks.
KRUX is a probing framework designed to disentangle the contributions of internal knowledge and explicit reasoning in LLMs when solving scientific problem-solving tasks. Developed as part of a systematic paper of scientific reasoning in LLMs, KRUX provides a methodology for isolating and measuring the performance improvements attributable to the provision of in-context, atomic “knowledge ingredients” (KIs) extracted from expert chain-of-thought traces. The framework is employed in conjunction with the SciReas and SciReas-Pro benchmark suites to characterize the respective bottlenecks and interactions of knowledge recall and reasoning in automated scientific reasoning systems (Li et al., 26 Aug 2025).
1. Motivation and Conceptual Framework
KRUX was introduced to address the absence of holistic evaluation paradigms able to separate the effects of knowledge retrieval and logical reasoning in scientific LLMs. The underlying motivation is to quantify how much of model performance is due to successful retrieval of correct, task-relevant facts versus execution of correct reasoning chains. The approach is anchored on three central objectives:
- Isolation of Knowledge and Reasoning: By fixing the knowledge state accessible to the model (via extracted KIs) and varying only the reasoning, or vice versa, KRUX facilitates controlled experiments on their individual and joint effects.
- Atomic Knowledge Ingredient Extraction: KRUX extracts discrete, answer-agnostic factual elements—definitions, mechanisms, relationships—from expert-generated chains of thought. These KIs serve as contextually relevant priors for the model, disambiguating knowledge-dependent failures from reasoning-dependent failures.
- Injection and Probing Pipeline: Injected KIs act as explicit context in the model prompt, permitting comparison between base model performance, self-extracted knowledge performance, and performance with external high-quality KIs.
2. Probing Methodology and Experimental Pipeline
The KRUX pipeline comprises four distinct steps:
- Scientific Query and Reasoning Trace Generation: A question sampled from SciReas or SciReas-Pro is provided to the model, eliciting both an answer and a detailed chain-of-thought reasoning trace.
- Expert Knowledge Ingredient Extraction: An “ingredient extractor”—typically a high-performance reasoning model such as DeepSeek-R1—processes the reasoning trace, identifying and isolating relevant knowledge ingredients.
- KI-Augmented Prompt Construction: The original question is reformulated with extracted KIs prepended, providing explicit knowledge context to the target model.
- Controlled Re-Response and Evaluation: The model is then tasked with answering the KI-augmented query, and statistical analysis is performed comparing the resulting performance to alternative setups: base prompt alone, base prompt with self-extracted KIs, and base prompt with expert-extracted KIs.
This structured probing protocol allows for systematic assessment of performance gains attributable to external knowledge provisioning versus internal retrieval and reasoning capability.
3. Research Questions and Analytical Framework
KRUX operationalizes three primary research questions (RQs):
- RQ1: To what extent do base LLMs (not fine-tuned for reasoning) benefit from high-quality, externally extracted knowledge ingredients?
- RQ2: Do models enhanced for chain-of-thought (CoT) reasoning gain further from injected expert KIs?
- RQ3: Does reasoning-oriented fine-tuning improve a model’s ability to recall and surface the most relevant internal facts?
Results are computed by comparing the performance gap between standard and KI-augmented settings across different models, with gains as high as 10–20 percentage points observed for base models receiving external KIs. This suggests that failure to retrieve correct task-relevant facts is a critical bottleneck in current scientific LLMs.
4. Key Findings of KRUX Analysis
KRUX analyses yield several principal results:
- Retrieval Bottleneck Dominance: The inability of base models to surface the necessary domain knowledge from their parameters, rather than a lack of reasoning capability, is a leading factor in limited performance. Provision of high-quality KIs remedies this bottleneck.
- Robustness to Knowledge Augmentation: Both base and reasoning-fine-tuned models show significant and systematic improvement when supplied with expert KIs, indicating that external memory modules could complement parametric knowledge stores.
- Reasoning Verbalization as a Knowledge Amplifier: Models trained or supervised with detailed chain-of-thought traces exhibit improved capacity to recall and utilize task-relevant facts. The act of verbalizing the reasoning trajectory enhances the likelihood of extracting pertinent knowledge from parametric weights.
A plausible implication is that complex reasoning training not only equips models for logical multi-step synthesis, but also indirectly improves their factual recall mechanisms in scientific domains.
5. Integration with Benchmark Suites
KRUX is closely coupled to the SciReas and SciReas-Pro evaluation suites:
- SciReas: A diverse aggregation of ten scientific reasoning benchmarks targeting a broad spectrum of task archetypes.
- SciReas-Pro: A highly selective subset wherein all necessary knowledge is made available yet the task retains multi-step complexity, permitting clean separation of reasoning and knowledge requirements.
KRUX’s analysis across these benchmarks under varying conditions (e.g., restricted versus full chain-of-thought token budgets) reveals interaction effects between knowledge provisioning and reasoning trajectory length.
6. Implementation Details and Objective Functions
While KRUX does not introduce novel optimization formulas specific to the framework, it builds upon established supervised fine-tuning (SFT) paradigms with chain-of-thought supervision:
Here, special tokens (> … ) demarcate reasoning segments, aligning with standard SFT objectives on reasoning-augmented datasets. The KI extraction process itself involves parsing reasoning traces through an ingredient extractor to produce atomic statements, which are then programmatically managed in prompt engineering pipelines.
7. Implications and Future Directions
The KRUX framework substantiates that explicit access to relevant domain knowledge is a decisive enabler for LLMs in scientific problem-solving. This suggests a dual trajectory for further research:
- External Memory Augmentation: Integration of retrieval-based or memory-augmented architectures could address retrieval bottlenecks highlighted by KRUX experiments.
- Advanced Reasoning Supervision: Longer and richer chain-of-thought finetuning not only improves reasoning, but also boosts internal knowledge accessibility.
In sum, KRUX provides a rigorous methodology for partitioning and evaluating the respective contributions of knowledge retrieval and logical reasoning in scientific LLMing. Its findings have substantial implications for the design of future automated scientific reasoners, the structure of scientific benchmarks, and the evaluation of LLM training strategies.