Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios (2410.19346v2)

Published 25 Oct 2024 in cs.CL and cs.CY

Abstract: LLMs are increasingly leveraged to empower autonomous agents to simulate human beings in various fields of behavioral research. However, evaluating their capacity to navigate complex social interactions remains a challenge. Previous studies face limitations due to insufficient scenario diversity, complexity, and a single-perspective focus. To this end, we introduce AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios. Drawing on Dramaturgical Theory, AgentSense employs a bottom-up approach to create 1,225 diverse social scenarios constructed from extensive scripts. We evaluate LLM-driven agents through multi-turn interactions, emphasizing both goal completion and implicit reasoning. We analyze goals using ERG theory and conduct comprehensive experiments. Our findings highlight that LLMs struggle with goals in complex social scenarios, especially high-level growth needs, and even GPT-4o requires improvement in private information reasoning. Code and data are available at \url{https://github.com/ljcleo/agent_sense}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Clayton P. Alderfer. 1969. An empirical test of a new theory of human needs. Organizational Behavior and Human Performance, 4(2):142–175.
  3. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337–351.
  4. Dean C Barnlund. 2017. A transactional model of communication. In Communication theory, pages 47–57. Routledge.
  5. Peter M Blau. 1964. Exchange and power in social life. new york: John wiley.
  6. Socialbench: Sociality evaluation of role-playing conversational agents. Preprint, arXiv:2403.13679.
  7. From persona to personalization: A survey on role-playing language agents. Preprint, arXiv:2404.18231.
  8. The wisdom of partisan crowds: Comparing collective intelligence in humans and llm-based agents. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 46.
  9. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
  10. Improving language model negotiation with self-play and in-context learning from ai feedback. Preprint, arXiv:2305.10142.
  11. Erving Goffman. 1959. The presentation of self in everyday life.
  12. Ai and the transformation of social science research. Science, 380(6650):1108–1109.
  13. Mistral 7b. arXiv preprint arXiv:2310.06825.
  14. Evaluating and inducing personality in pre-trained language models. Advances in Neural Information Processing Systems, 36.
  15. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  16. Henri Lefebvre. 1991. The Production of Space. The Production of Space.
  17. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36:51991–52008.
  18. Metaagents: Simulating interactions of human behaviors for llm-based task-oriented coordination via collaborative generative agents. arXiv preprint arXiv:2310.06500.
  19. Agentbench: Evaluating llms as agents. Preprint, arXiv:2308.03688.
  20. From skepticism to acceptance: Simulating the attitude dynamics toward fake news. arXiv preprint arXiv:2403.09498.
  21. Abraham Harold Maslow. 1943. A theory of human motivation. Psychological Review, 50:370.
  22. Unveiling the truth and facilitating change: Towards agent-based large-scale social movement simulation. In Findings of the Association for Computational Linguistics ACL 2024, pages 4789–4809.
  23. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  24. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789.
  25. Emobench: Evaluating the emotional intelligence of large language models. arXiv preprint arXiv:2402.12071.
  26. Socialiqa: Commonsense reasoning about social interactions. Preprint, arXiv:1904.09728.
  27. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36.
  28. Character-llm: A trainable agent for role-playing. Preprint, arXiv:2310.10158.
  29. Clever hans or neural theory of mind? stress testing social reasoning in large language models. arXiv preprint arXiv:2305.14763.
  30. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36.
  31. Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. npj Mental Health Research, 3.
  32. Qwen Team. 2024. Qwen2.5: A party of foundation models.
  33. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  34. Ronald E. Walker and Jeanne M. Foley. 1973. Social intelligence: Its history and measurement. Psychological Reports, 33(3):839–864.
  35. Towards objectively benchmarking social intelligence for language agents at action level. Preprint, arXiv:2404.05337.
  36. Social-iq 2.0 challenge: Benchmarking multimodal social understanding.
  37. Autogen: Enabling next-gen llm applications via multi-agent conversation. In ICLR 2024 Workshop on Large Language Model (LLM) Agents.
  38. Can large language model agents simulate human trust behaviors? Preprint, arXiv:2402.04559.
  39. Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658.
  40. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  41. Is this the real life? is this just fantasy? the misleading success of simulating social interactions with llms. arXiv preprint arXiv:2403.05020.
  42. Sotopia: Interactive evaluation for social intelligence in language agents. Preprint, arXiv:2310.11667.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com