Understanding Social Reasoning in Language Models with Language Models (2306.15448v2)

Published 21 Jun 2023 in cs.CL, cs.AI, and cs.HC

Abstract: As LLMs become increasingly integrated into our everyday lives, understanding their ability to comprehend human mental states becomes critical for ensuring effective interactions. However, despite the recent attempts to assess the Theory-of-Mind (ToM) reasoning capabilities of LLMs, the degree to which these models can align with human ToM remains a nuanced topic of exploration. This is primarily due to two distinct challenges: (1) the presence of inconsistent results from previous evaluations, and (2) concerns surrounding the validity of existing evaluation methodologies. To address these challenges, we present a novel framework for procedurally generating evaluations with LLMs by populating causal templates. Using our framework, we create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations. We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations. Using BigToM, we evaluate the social reasoning capabilities of a variety of LLMs and compare model performances with human performance. Our results suggest that GPT4 has ToM capabilities that mirror human inference patterns, though less reliable, while other LLMs struggle.

PDF Abstract

Understanding Social Reasoning in LLMs with LLMs

The paper "Understanding Social Reasoning in LLMs with LLMs" presents an innovative framework aimed at assessing the Theory-of-Mind (ToM) capabilities of LLMs. As these models play an increasingly significant role in everyday interactions, understanding their ability to comprehend human mental states becomes crucial.

Key Contributions and Methodology

The authors recognize two primary challenges in evaluating ToM in LLMs: inconsistencies in previous results and concerns about the validity of existing evaluation methods. To address these, they propose a novel approach for generating evaluation scenarios procedurally, using causal templates. This process allows for the systematic generation of a large, diverse benchmark—BigToM—comprising 25 control conditions and 5,000 model-generated evaluations. The key steps in their approach involve:

Building Causal Templates: Abstracting the domain of interest into a causal graph format, which in this case, represents ToM scenarios. This includes variables such as desires, beliefs, and actions.
Populating Templates: Leveraging LLMs like GPT-4 to fill in these causal templates, crafting scenarios that can test various inferences pertinent to ToM. This process generates a structured set of evaluations from randomly populated causal templates.
Composing and Testing: Constructing test items by selecting and combining variables to create coherent scenarios. They conducted these evaluations across multiple versions for each scenario to test different inferential capabilities.

The paper tests these capabilities across different LLMs, including GPT-4, gpt-3.5-turbo, Claude-v1.3, clf-v2, and llama-65b, using BigToM. The results indicate that GPT-4 shows inference patterns akin to human reasoning but is still less reliable, especially in more intricate backward belief inferences.

Numerical Results and Claims

The paper emphasizes the superior performance of GPT-4 among other models tested, noting human-level ToM patterns in specific contexts like forward belief and action inference tasks. Notably, while gpt-4 approaches human-level accuracy in forward belief inference, it, along with other LLMs, struggles significantly with backward belief inference—highlighting an area for substantial improvement.

Implications and Future Directions

The framework offers a robust method for systematically probing LLMs' reasoning capabilities, presenting a scalable and replicable approach to evaluation without the need for extensive human-crafted benchmarks. This has significant implications for both the theoretical understanding of LLMs' cognitive models and practical applications, aiding in creating more socially aware AI systems.

Future developments might focus on extending this framework to more complex social reasoning tasks or exploring different domains where causal reasoning is pivotal. This includes potentially real-time interactions or more nuanced scenarios involving second-order beliefs and dynamic perceptions. Additionally, refining the ability to generalize inferences might lend itself to improvements in how LLMs are models in more integrated social tasks.

Overall, this research lays foundational groundwork for improving AI's alignment with human social cognition, fostering interactions that are not only effective but also intuitive and aligned with human expectations.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Kanishk Gandhi (20 papers)
Jan-Philipp Fränken (12 papers)
Tobias Gerstenberg (18 papers)
Noah D. Goodman (83 papers)

Citations (72)

View on Semantic Scholar

Related Papers

Find Related Papers

Reddit

Understanding Social Reasoning in Language Models with Language Models (45 points, 3 comments)