Robust LLM reasoning under internally discordant evidence

Determine whether contemporary large language model systems can reason robustly in scenarios where the available evidence is internally discordant, i.e., where different evidence sources conflict with each other.

Background

The paper emphasizes that most evaluations of LLM systems occur in settings with internally consistent evidence, whereas real-world decision-making, particularly in clinical contexts, often involves conflicting signals that must be reconciled. This raises uncertainty about the reliability of LLM reasoning when evidence is discordant.

To directly study this question, the authors construct MIMIC-DOS, a benchmark dataset isolating cases where subjective indicators (e.g., pain scores, RASS) appear reassuring while objective signals (e.g., mean arterial pressure) indicate potential risk. The framework CARE is introduced to address such scenarios while preserving privacy. The open question concerns the fundamental capability of current LLM systems to reason robustly in these discordant-evidence settings.

References

As a result, it remains unclear whether current LLM systems can reason robustly when the available evidence is internally discordant.

CARE: Privacy-Compliant Agentic Reasoning with Evidence Discordance  (2604.01113 - Liu et al., 1 Apr 2026) in Section 1 (Introduction)