Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events (2506.06808v1)

Published 7 Jun 2025 in cs.CL and cs.AI

Abstract: Can LLMs reliably predict that possible events are more likely than merely improbable ones? By teasing apart possibility, typicality, and contextual relatedness, we show that despite the results of previous work, LLMs' ability to do this is far from robust. In fact, under certain conditions, all models tested - including Llama 3, Gemma 2, and Mistral NeMo - perform at worse-than-chance level, assigning higher probabilities to impossible sentences such as 'the car was given a parking ticket by the brake' than to merely unlikely sentences such as 'the car was given a parking ticket by the explorer'.

Authors (4)

James A. Michaelov (13 papers)
Reeka Estacio (1 paper)
Zhien Zhang (1 paper)
Benjamin K. Bergen (31 papers)

Summary

Evaluation of Possibility and Improbability in LLMs

The paper "Not quite Sherlock Holmes: LLM predictions do not reliably differentiate impossible from improbable events" critically examines the predictive capabilities of contemporary LLMs (LMs) concerning event possibility. Conducted by Michaelov, Estacio, Zhang, and Bergen, the paper questions LMs' ability to differentiate between impossible and merely improbable events—a crucial understanding for applications in domains requiring precise event interpretation, such as medicine.

Key Findings and Methodology

The authors investigate whether several models, including Llama 3, Gemma 2, and Mistral NeMo, demonstrate the capability to select possible events over impossible ones, particularly when semantic relatedness and typicality present conflicting cues. Using a robust methodology involving minimal pairs of sentences, they assess if models can ascribe higher probabilities to sentences representing possible scenarios over impossible ones when typical and non-typical cues are counterbalanced.

The paper utilizes a diverse set of stimuli differentiated by event typicality and context relatedness, as outlined in their models:

Possible-Typical-Related (PTR)
Possible-Atypical-Related (PAR)
Possible-Atypical-Unrelated (PAU)
Impossible-Atypical-Related (IAR)
Impossible-Atypical-Unrelated (IAU)

The performance of each LM is assessed over language contexts involving English and Mandarin. Analytical frameworks such as logistic mixed-effects regression are employed to explore the influence of semantic similarity and typicality on prediction accuracy.

Numerical Insights

The findings detail that LMs perform suboptimally in differentiating between impossible and improbable events—at times achieving performance below random chance. When tasked with distinguishing possible but atypical events from impossible ones, the models frequently failed, notably when semantic relatedness was misleading. The results underscore that models consistently rely more on contextual relatedness than on actual event possibility, assigning higher likelihoods to semantically related yet logically impossible constructions.

Interestingly, the size and scale of the LLMs, often a determinant of performance in other tasks, do not significantly mitigate these limitations. The paper finds that scaling in terms of model size and training data does not improve LMs’ ability to discern possible from impossible if contextual relatedness is not a suitable cue, challenging assumptions about scaling's efficacy.

Implications for Future Developments

These insights invite reconsideration of the use of current LMs in sensitive and high-stakes domains. The paper suggests that advancements are needed that focus explicitly on world comprehension beyond mere context-based patterning. Improvement may necessitate novel architectures or training paradigms that prioritize world knowledge integration, potentially moving beyond language-based input to multimodal learning environments.

Furthermore, the paper highlights the inadequacy of merely increasing model scale or dataset size to reach robust understanding, encouraging future research to explore more effective learning signals and potential cross-disciplinary approaches enhancing LMs' event understanding and decision-making capabilities.

Conclusion

This paper provides essential evidence that current LMs exhibit significant gaps in distinguishing between impossible and improbable events, especially when semantic relatedness and possibility are misaligned. The implications for AI applications in real-world and critical scenarios are profound, suggesting a roadmap for the next generation of LMs focused on integrating richer cognitive frameworks beyond simplistic linguistic or statistical patterns. The paper makes a compelling case for advancing LMs toward a more authentic understanding of possibility versus improbability, posing foundational challenges and opportunities for AI research and development.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/jamichaelov/status/1933223215448666539

YouTube

Show All Videos