AntEval: Evaluation of Social Interaction Competencies in LLM-Driven Agents (2401.06509v3)
Abstract: LLMs have demonstrated their ability to replicate human behaviors across a wide range of scenarios. However, their capability in handling complex, multi-character social interactions has yet to be fully explored, primarily due to the absence of robust, quantitative evaluation methods. This gap has slowed the development of agents proficient in more nuanced interactions beyond simple exchanges, for example, small talk. To address this challenge, we introduce the Multi-Agent Interaction Evaluation Framework (AntEval), encompassing a novel interaction framework and evaluation methods. The interaction framework aims to foster an complex interaction environment that bolsters information exchange and intention expression within social interactions. Furthermore, we introduce evaluation methods, including two metrics: Information Exchanging Precision (IEP) and Interaction Expressiveness Gap (IEG), designed for the quantitative and objective assessment of agents' interaction competencies. Our findings highlight the utility of these evaluative methods and show significant potential for improving LLMs' ability to construct agents that interact in a more natural manner with human-like intricacy.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- Studenteval: A benchmark of student-written prompts for large language models of code. arXiv preprint arXiv:2306.04556.
- Dungeons and dragons as a dialog challenge for artificial intelligence. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9379–9393, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
- Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398.
- Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Seeg: Semantic energized co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10473–10482.
- Towards better railway service: Passengers counting in railway compartment. IEEE Transactions on Circuits and Systems for Video Technology, 31(2):439–451.
- Maal: Multimodality-aware autoencoder-based affordance learning for 3d articulated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 217–227.
- Penalizing the hard example but not too much: A strong baseline for fine-grained visual classification. IEEE Transactions on Neural Networks and Learning Systems.
- A simple episodic linear probe improves visual recognition in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9559–9569.
- Icocap: Improving video captioning by compounding images. IEEE Transactions on Multimedia.
- Tachikuma: Understading complex interactions with multi-character and novel objects by large language models. arXiv preprint arXiv:2307.12573.
- Annie Louis and Charles Sutton. 2018. Deep dungeons and dragons: Learning character-action interactions from role-playing game transcripts. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 708–713.
- Pax Newman and Yudong Liu. 2022. Generating descriptive and rules-adhering spells for dungeons & dragons fifth edition. In Proceedings of the 9th Workshop on Games and Natural Language Processing within the 13th Language Resources and Evaluation Conference, pages 54–60, Marseille, France. European Language Resources Association.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Telling stories through multi-user dialogue by modeling character relations. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 269–275, Singapore and Online. Association for Computational Linguistics.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498.
- Ontologically faithful generation of non-player character dialogues. arXiv preprint arXiv:2212.10618.
- Superclue: A comprehensive chinese large language model benchmark. arXiv preprint arXiv:2307.15020.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
- An ai dungeon master’s guide: Learning to converse and guide with intents and theory-of-mind in dungeons and dragons. arXiv preprint arXiv:2212.10060.
- Fireball: A dataset of dungeons and dragons actual-play with structured game state information. arXiv preprint arXiv:2305.01528.
- Yuanzhi Liang (7 papers)
- Linchao Zhu (78 papers)
- Yi Yang (855 papers)