Disentangling context-length effects from theory-of-mind demands in CharToM-QA

Determine whether the primary source of difficulty in answering questions in the CharToM-QA benchmark arises from processing long input contexts (novel-length passages exceeding 2,000 words) or from the theory-of-mind reasoning requirements of the questions themselves.

Background

CharToM-QA employs classic novels to study theory-of-mind understanding across characters’ perspectives, but the benchmark’s contexts are lengthy (over 2,000 words). This introduces a potential confound between long-context processing and ToM reasoning.

The paper notes that it is unclear which factor primarily drives the challenge in CharToM-QA, motivating their development of shorter-context datasets to better isolate ToM-related capabilities. Establishing the source of difficulty would help attribute model performance deficits appropriately.

References

However, their questions are based on lengthy novels (over 2K words), making it unclear whether the challenge stems from context length or ToM understanding.

Are LLMs Smarter Than Chimpanzees? An Evaluation on Perspective Taking and Knowledge State Estimation  (2601.12410 - Yang et al., 18 Jan 2026) in Section 2, Related Works — Theory of Mind (ToM) in LLMs