Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 168 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 130 tok/s Pro
Kimi K2 170 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

The Knowledge-Reasoning Dissociation: Fundamental Limitations of LLMs in Clinical Natural Language Inference (2508.10777v1)

Published 14 Aug 2025 in cs.AI

Abstract: LLMs are often assumed to acquire increasingly structured, generalizable internal representations simply by scaling data and parameters. We interrogate this assumption by introducing a Clinical Trial Natural Language Inference benchmark comprising four reasoning families, Causal Attribution, Compositional Grounding, Epistemic Verification, and Risk State Abstraction. Each item is paired with a targeted Ground Knowledge and Meta-Level Reasoning Verification (GKMRV) probe, allowing us to dissociate failures of factual access from failures of inference. We evaluate six contemporary LLMs under both direct and chain of thought prompting. Models achieve near-ceiling GKMRV accuracy (mean accuracy 0.918) yet perform poorly on the main reasoning tasks (mean accuracy 0.25). Despite low accuracy, output inferences are highly consistent across samples (mean 0.87), indicating a systematic application of underlying heuristics and shortcuts. These results reveal fundamental structural and representational limitations: current LLMs often possess the relevant clinical knowledge but lack the structured, composable internal representations needed to deploy it reliably (e.g., integrating constraints, weighing evidence, or simulating counterfactuals). Decoupling knowledge from reasoning with GKMRV makes this dissociation explicit and measurable, providing an effective framework for probing the reliability of LLMs in high-stakes domains.

Summary

  • The paper presents a novel CTNLI benchmark that decouples factual retrieval from inferential reasoning in clinical scenarios.
  • It shows that while LLMs achieve near-ceiling GKMRV accuracy, their performance on reasoning tasks, especially compositional grounding, remains critically low.
  • The study highlights the need for neuro-symbolic integration and disentangled representations to overcome the inherent reasoning limitations of current LLMs.

The Knowledge-Reasoning Dissociation: Fundamental Limitations of LLMs in Clinical Natural Language Inference

Introduction

"The Knowledge-Reasoning Dissociation: Fundamental Limitations of LLMs in Clinical Natural Language Inference" explores the underlying challenges LLMs face in clinical reasoning tasks. By establishing a Clinical Trial Natural Language Inference (CTNLI) benchmark, this paper critically evaluates whether LLMs possess the structured, composable internal representations necessary for effective inference in clinical scenarios. Through a series of targeted reasoning families and Ground Knowledge and Meta-Level Reasoning Verification (GKMRV) probes, the paper disentangles factual knowledge retrieval from inferential reasoning, revealing persistent structural flaws in LLM reasoning capabilities.

Benchmark Design and Methodology

The paper introduces a CTNLI benchmark targeting four reasoning competencies: Causal Attribution, Compositional Grounding, Epistemic Verification, and Risk State Abstraction. Each reasoning task is paired with GKMRV probes to dissociate the possession of factual knowledge from the ability to apply it inferentially. Evaluations were conducted on six contemporary LLMs using both direct and chain-of-thought prompting, effectively isolating the knowledge-reasoning dissociation in high-stakes clinical domains.

Ground Knowledge and Meta-Level Reasoning Verification

Models achieved near-ceiling GKMRV accuracy (mean accuracy 0.918), indicating proficient factual knowledge retrieval. However, their performance on main reasoning tasks remained poor (mean accuracy 0.25), with extreme failures in Compositional Grounding (0.04 accuracy). This discrepancy underscores a fundamental limitation: while models encode relevant clinical knowledge, they lack structured, composable internal representations for deploying it reliably. Figure 1

Figure 1: Representative examples of clinical reasoning tasks, and ground knowledge and meta-level reasoning verification (GKMRV). Across tasks, LLMs exhibit fluent but structurally flawed reasoning, despite possessing the necessary ground knowledge.

Analysis of Reasoning Capabilities

Causal Attribution

In Causal Attribution tasks, models fail to distinguish between observational associations and causal claims due to an absence of coherent internal models for simulating counterfactual outcomes. Despite encoding the necessary causal principles, models often default to heuristics, equating numerical effect size with efficacy and failing to implement causal reasoning when provided with observational data alone.

Compositional Grounding

Models exhibit near-total failure in Compositional Grounding tasks. Despite encoding factual knowledge regarding drug-dose-condition compatibility, models default to decomposing global constraints into isolated associations, failing to capture joint interdependencies. This results in a systematic endorsement of clinically contraindicated configurations.

Epistemic Verification

Epistemic Verification tasks reveal a breakdown in models' ability to resolve inconsistencies through evidential reasoning. Models often defer epistemically to clinician assertions over evidentiary coherence, suggesting a reliance on heuristics rather than structured reasoning mechanisms capable of prioritizing objective evidence.

Risk State Abstraction

In Risk State Abstraction tasks, models typically falter at simulating potential adverse scenarios and fail to integrate event frequency with severity for risk assessment. The absence of composable latent structures to evaluate future risks hinders the models' ability to perform urgent-evaluation judgments or effectively contrast event severity. Figure 2

Figure 2: Accuracy gap between main task performance and GKMRV accuracy, across four tasks. Values closer to zero indicate strong alignment between reasoning and underlying knowledge.

Implications and Future Directions

The findings emphasize a pervasive knowledge-reasoning dissociation in LLMs, highlighting a representational deficit that limits their deployment in clinical reasoning tasks. Addressing these limitations requires:

  1. Neuro-symbolic Integration: Integrating symbolic reasoning with neural networks to enhance structured reasoning capabilities.
  2. Representation Disentanglement: Developing methods to promote disentangled, composable latent representations of clinical concepts.
  3. Module Separation: Architecting systems that isolate factual retrieval from reasoning processes to improve transparency and function.
  4. Explicit Counterfactual Reasoning: Incorporating modules for counterfactual simulation and probabilistic reasoning to support robust clinical inference.
  5. Refined Benchmarks: Expanding benchmarks that target structured reasoning capabilities with controlled, adaptive diagnostic tasks.

Conclusion

This paper highlights a significant gap between factual knowledge encoding and inferential reasoning in LLMs, which current architectures and scaling strategies cannot bridge. Future research must focus on integrating structured reasoning within LLMs to achieve reliable, context-sensitive clinical decision-making capabilities, transcending superficial fluency in LLMs.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.