OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models (2402.06044v3)

Published 8 Feb 2024 in cs.AI and cs.CL

Abstract: Neural Theory-of-Mind (N-ToM), machine's ability to understand and keep track of the mental states of others, is pivotal in developing socially intelligent agents. However, prevalent N-ToM benchmarks have several shortcomings, including the presence of ambiguous and artificial narratives, absence of personality traits and preferences, a lack of questions addressing characters' psychological mental states, and limited diversity in the questions posed. In response to these issues, we construct OpenToM, a new benchmark for assessing N-ToM with (1) longer and clearer narrative stories, (2) characters with explicit personality traits, (3) actions that are triggered by character intentions, and (4) questions designed to challenge LLMs' capabilities of modeling characters' mental states of both the physical and psychological world. Using OpenToM, we reveal that state-of-the-art LLMs thrive at modeling certain aspects of mental states in the physical world but fall short when tracking characters' mental states in the psychological world.

Citations (10)

View on Semantic Scholar

Summary

The paper introduces OpenToM, a novel benchmark that uses realistic narratives to evaluate Theory-of-Mind reasoning in large language models.
It employs a human-in-the-loop generation process to craft coherent stories that distinguish first and second-order ToM reasoning.
Results reveal LLMs better infer physical states than psychological ones, highlighting key gaps for advancing social AI research.

OpenToM: A New Benchmark for Assessing Theory-of-Mind in LLMs

Overview of the OpenToM Benchmark

The OpenToM dataset introduces a novel benchmark designed for comprehensively evaluating the Theory-of-Mind (ToM) reasoning capabilities of LLMs. Addressing limitations in existing assessments, OpenToM presents narratives featuring explicit character personality traits, motivations, and diverse actions to enhance the portrayal of psychological states and intentions. This benchmark has been meticulously structured to encapsulate a broad spectrum of social interactions, drawing clear distinctions between first and second-order ToM reasoning through its multi-faceted narrative and question design.

Construction and Innovation

OpenToM distinguishes itself through its innovative generation process, utilizing LLMs driven by a human-in-the-loop approach. This method not only endows characters with distinct personalities and motivations but also ensures the creation of coherent and realistic narratives that reflect genuine human-social interactions. Unlike previous benchmarks, which often relied on templated narratives, OpenToM's stories are more natural and complex, presenting actions motivated by character intention and facilitating a more robust evaluation of N-ToM capabilities in LLMs.

Evaluation and Insights

OpenToM was utilized to evaluate several state-of-the-art LLMs, including different versions of Llama2-Chat, Mixtral-8x7B-Instruct, GPT-3.5-Turbo, and GPT-4-Turbo. The results highlighted a significant disparity in the models' abilities to infer physical vs. psychological states, with most models displaying a more profound understanding of the physical over the psychological states of characters. Additionally, the effectiveness of advanced prompting techniques like Chain-of-Thought (CoT) and Simulated-ToM (SimToM) was assessed. While these techniques improved N-ToM reasoning in specific areas, they did not significantly enhance performance across all question types, underscoring the challenges LLMs face in fully grasping complex psychological states and social norms.

Practical and Theoretical Implications

The findings from OpenToM underscore the existing gaps in the N-ToM reasoning capabilities of LLMs, especially in interpreting the psychological world and leveraging social commonsense. Revealing these shortcomings is vital for guiding future research aimed at overcoming these limitations, thus pushing the boundaries of what LLMs can achieve in understanding and interacting within human-social contexts. OpenToM not only serves as a valuable tool for benchmarking but also as a roadmap for developing more socially aware and emotionally intelligent AI systems.

Future Directions

Looking forward, enhancing LLMs’ ToM reasoning capabilities, particularly in discerning nuanced psychological states and integrating social commonsense, is imperative. Research could explore neuro-symbolic approaches or more sophisticated prompting strategies that better capture the complexity of human-social interactions. Furthermore, extending the OpenToM dataset to include even more diverse scenarios and interactions could prove beneficial in comprehensively evaluating and advancing the state of N-ToM in LLMs.

Summary and Conclusion

OpenToM represents a significant leap forward in the quest to evaluate and improve the theory-of-mind reasoning of LLMs. By addressing previous benchmarks' shortcomings and introducing more complex, realistic scenarios, it sets a new standard for assessing AI's understanding of human psychological states. The insights drawn from the evaluations using OpenToM highlight critical areas for future research, bringing us closer to realizing AI systems capable of genuine social intelligence.

Related Papers

Tweets

https://twitter.com/Teknium1/status/1756897609745408448

https://twitter.com/paul_cal/status/1756955623025950973

https://twitter.com/upstreetai/status/1763242743113924840

https://twitter.com/arxivsanitybot/status/1757034260161085819

https://twitter.com/SolidReturnLda/status/1757091091676082626