Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents (2310.11667v2)

Published 18 Oct 2023 in cs.AI, cs.CL, and cs.LG

Abstract: Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and evaluate their social intelligence. In our environment, agents role-play and interact under a wide variety of scenarios; they coordinate, collaborate, exchange, and compete with each other to achieve complex social goals. We simulate the role-play interaction between LLM-based agents and humans within this task space and evaluate their performance with a holistic evaluation framework called SOTOPIA-Eval. With SOTOPIA, we find significant differences between these models in terms of their social intelligence, and we identify a subset of SOTOPIA scenarios, SOTOPIA-hard, that is generally challenging for all models. We find that on this subset, GPT-4 achieves a significantly lower goal completion rate than humans and struggles to exhibit social commonsense reasoning and strategic communication skills. These findings demonstrate SOTOPIA's promise as a general platform for research on evaluating and improving social intelligence in artificial agents.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Xuhui Zhou (33 papers)
  2. Hao Zhu (212 papers)
  3. Leena Mathur (13 papers)
  4. Ruohong Zhang (11 papers)
  5. Haofei Yu (17 papers)
  6. Zhengyang Qi (6 papers)
  7. Louis-Philippe Morency (123 papers)
  8. Yonatan Bisk (91 papers)
  9. Daniel Fried (69 papers)
  10. Graham Neubig (342 papers)
  11. Maarten Sap (86 papers)
Citations (73)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com