Boosting Theory-of-Mind Performance in Large Language Models via Prompting (2304.11490v3)

Published 22 Apr 2023 in cs.AI and cs.CL

Abstract: LLMs excel in many tasks in 2023, but they still face challenges in complex reasoning. Theory-of-mind (ToM) tasks, which require understanding agents' beliefs, goals, and mental states, are essential for common-sense reasoning involving humans, making it crucial to enhance LLM performance in this area. This study measures the ToM performance of GPT-4 and three GPT-3.5 variants (Davinci-2, Davinci-3, GPT-3.5-Turbo), and investigates the effectiveness of in-context learning in improving their ToM comprehension. We evaluated prompts featuring two-shot chain of thought reasoning and step-by-step thinking instructions. We found that LLMs trained with Reinforcement Learning from Human Feedback (RLHF) (all models excluding Davinci-2) improved their ToM accuracy via in-context learning. GPT-4 performed best in zero-shot settings, reaching nearly 80% ToM accuracy, but still fell short of the 87% human accuracy on the test set. However, when supplied with prompts for in-context learning, all RLHF-trained LLMs exceeded 80% ToM accuracy, with GPT-4 reaching 100%. These results demonstrate that appropriate prompting enhances LLM ToM reasoning, and they underscore the context-dependent nature of LLM cognitive capacities.

PDF Abstract

Enhancing Theory of Mind in LLMs through Prompting

The paper "Boosting Theory-of-Mind Performance in LLMs" presents an in-depth analysis of enhancing Theory of Mind (ToM) capabilities in LLMs, particularly focusing on models from the GPT family, such as GPT-4 and GPT-3.5 variants. ToM tasks are crucial in evaluating models' abilities to comprehend mental states, beliefs, and goals of agents, thereby involving a form of complex inference critical for natural language understanding.

Methodology and Experimental Setup

The authors investigated four main models: GPT-4, and the GPT-3.5 variants Davinci-2, Davinci-3, and GPT-3.5-Turbo. These models undergo varied training procedures, with most being fine-tuned via Reinforcement Learning from Human Feedback (RLHF), an integral component for enhancing performance in reasoning tasks. The paper employed standardized ToM and control scenarios, originally utilized in human studies, to assess and compare model capabilities.

Key approaches explored include zero-shot, step-by-step reasoning, few-shot, and chain-of-thought (CoT) reasoning prompting methods. This examination utilized carefully crafted prompt examples to measure performance shifts, focusing on task accuracy across scenarios and repetition to ensure result reliability.

Findings and Numerical Results

Initial findings in zero-shot conditions revealed that newer models typically performed better on control tasks but showed mixed results on ToM tasks. GPT-4 notably achieved about 80% accuracy, surpassing its predecessors. However, with CoT prompting, significant performance elevation was observed in most RLHF-trained models—GPT-4 attained perfect accuracy when prompted appropriately, and both GPT-3.5-Turbo and Davinci-3 exceeded human-level ToM performance when prompted with CoT and step-by-step reasoning.

The prompting techniques effectively improved LLMs' inferential reasoning abilities, indicating that these enhancements stem from invoking a mode of systematic reasoning rather than mere imitation of reasoning steps.

Implications and Future Directions

These results demonstrate the non-trivial role of prompting in unlocking LLM capabilities for complex tasks such as ToM reasoning. This can inform future work in AI, suggesting that the context of question framing and reasoning process instructions can significantly influence model performance. Furthermore, the paper emphasizes the context-sensitive nature of large models, reminding the research community of the latent potential that suitable prompting techniques may unveil.

This work aligns with ongoing discussions about AI reasoning capabilities, prompting further exploration into general inferential tasks beyond ToM. The results encourage interdisciplinary approaches to model training and refinement, leveraging human feedback and structured reasoning frameworks to enhance the reliability and performance of AI systems in socially-themed or context-dependent tasks.

Future research should explore more diverse task frameworks and different categories of inferential reasoning. Expanding the variety of prompts and testing conditions would provide deeper insights into the robustness and scalability of the presented prompting strategies in facilitating reasoning in LLMs.