Generalize findings to proprietary reasoning models (Claude, ChatGPT)
Determine whether the observed mismatch between human-perceived causal dependencies and the model’s actual causal dependencies in reasoning texts—identified for DeepSeek-R1 0528 (671B) and Qwen-3 32B on GSM8K—also holds for proprietary systems such as Anthropic Claude and OpenAI ChatGPT, despite the lack of direct access to their internal reasoning processes.
References
While these models provide valuable insights, extending these findings to proprietary systems like Claude or ChatGPT remains an open question, as we lack direct access to examine their reasoning processes.
— Humans Perceive Wrong Narratives from AI Reasoning Texts
(2508.16599 - Levy et al., 9 Aug 2025) in Limitations — Model Selection