Emergent Mind


Neural Theory-of-Mind (N-ToM), machine's ability to understand and keep track of the mental states of others, is pivotal in developing socially intelligent agents. However, prevalent N-ToM benchmarks have several shortcomings, including the presence of ambiguous and artificial narratives, absence of personality traits and preferences, a lack of questions addressing characters' psychological mental states, and limited diversity in the questions posed. In response to these issues, we construct OpenToM, a new benchmark for assessing N-ToM with (1) longer and clearer narrative stories, (2) characters with explicit personality traits, (3) actions that are triggered by character intentions, and (4) questions designed to challenge LLMs' capabilities of modeling characters' mental states of both the physical and psychological world. Using OpenToM, we reveal that state-of-the-art LLMs thrive at modeling certain aspects of mental states in the physical world but fall short when tracking characters' mental states in the psychological world.


  • The OpenToM dataset introduces a new benchmark aimed at evaluating the theory-of-mind (ToM) reasoning capabilities of LLMs through narratives that include character personality traits, motivations, and actions.

  • Through an innovative creation process involving LLMs and a human-in-the-loop approach, OpenToM ensures realistic and coherently structured narratives for assessing first and second-order ToM reasoning.

  • Evaluations of state-of-the-art LLMs with OpenToM reveal notable differences in the models' abilities to understand physical versus psychological states, even with advanced prompting techniques like Chain-of-Thought and Simulated-ToM.

  • The results from OpenToM emphasize the need for further research to improve LLMs' N-ToM reasoning, suggesting future directions such as neuro-symbolic approaches and sophisticated prompting strategies.

Overview of the OpenToM Benchmark

The OpenToM dataset introduces a novel benchmark designed for comprehensively evaluating the Theory-of-Mind (ToM) reasoning capabilities of LLMs. Addressing limitations in existing assessments, OpenToM presents narratives featuring explicit character personality traits, motivations, and diverse actions to enhance the portrayal of psychological states and intentions. This benchmark has been meticulously structured to encapsulate a broad spectrum of social interactions, drawing clear distinctions between first and second-order ToM reasoning through its multi-faceted narrative and question design.

Construction and Innovation

OpenToM distinguishes itself through its innovative generation process, utilizing LLMs driven by a human-in-the-loop approach. This method not only endows characters with distinct personalities and motivations but also ensures the creation of coherent and realistic narratives that reflect genuine human-social interactions. Unlike previous benchmarks, which often relied on templated narratives, OpenToM's stories are more natural and complex, presenting actions motivated by character intention and facilitating a more robust evaluation of N-ToM capabilities in LLMs.

Evaluation and Insights

OpenToM was utilized to evaluate several state-of-the-art LLMs, including different versions of Llama2-Chat, Mixtral-8x7B-Instruct, GPT-3.5-Turbo, and GPT-4-Turbo. The results highlighted a significant disparity in the models' abilities to infer physical vs. psychological states, with most models displaying a more profound understanding of the physical over the psychological states of characters. Additionally, the effectiveness of advanced prompting techniques like Chain-of-Thought (CoT) and Simulated-ToM (SimToM) was assessed. While these techniques improved N-ToM reasoning in specific areas, they did not significantly enhance performance across all question types, underscoring the challenges LLMs face in fully grasping complex psychological states and social norms.

Practical and Theoretical Implications

The findings from OpenToM underscore the existing gaps in the N-ToM reasoning capabilities of LLMs, especially in interpreting the psychological world and leveraging social commonsense. Revealing these shortcomings is vital for guiding future research aimed at overcoming these limitations, thus pushing the boundaries of what LLMs can achieve in understanding and interacting within human-social contexts. OpenToM not only serves as a valuable tool for benchmarking but also as a roadmap for developing more socially aware and emotionally intelligent AI systems.

Future Directions

Looking forward, enhancing LLMs’ ToM reasoning capabilities, particularly in discerning nuanced psychological states and integrating social commonsense, is imperative. Research could explore neuro-symbolic approaches or more sophisticated prompting strategies that better capture the complexity of human-social interactions. Furthermore, extending the OpenToM dataset to include even more diverse scenarios and interactions could prove beneficial in comprehensively evaluating and advancing the state of N-ToM in LLMs.

Summary and Conclusion

OpenToM represents a significant leap forward in the quest to evaluate and improve the theory-of-mind reasoning of LLMs. By addressing previous benchmarks' shortcomings and introducing more complex, realistic scenarios, it sets a new standard for assessing AI's understanding of human psychological states. The insights drawn from the evaluations using OpenToM highlight critical areas for future research, bringing us closer to realizing AI systems capable of genuine social intelligence.

Get summaries of trending AI/ML papers delivered straight to your inbox

Unsubscribe anytime.