CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments (2411.02305v2)
Abstract: Customer Relationship Management (CRM) systems are vital for modern enterprises, providing a foundation for managing customer interactions and data. Integrating AI agents into CRM systems can automate routine processes and enhance personalized service. However, deploying and evaluating these agents is challenging due to the lack of realistic benchmarks that reflect the complexity of real-world CRM tasks. To address this issue, we introduce CRMArena, a novel benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments. Following guidance from CRM experts and industry best practices, we designed CRMArena with nine customer service tasks distributed across three personas: service agent, analyst, and manager. The benchmark includes 16 commonly used industrial objects (e.g., account, order, knowledge article, case) with high interconnectivity, along with latent variables (e.g., complaint habits, policy violations) to simulate realistic data distributions. Experimental results reveal that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities. Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments. CRMArena is an open challenge to the community: systems that can reliably complete tasks showcase direct business value in a popular work environment.
- The art of saying no: Contextual noncompliance in language models. arXiv preprint arXiv:2407.12043.
- Mind2web: Towards a generalist agent for the web. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Workarena: How capable are web agents at solving common knowledge work tasks? In Forty-first International Conference on Machine Learning.
- WebVoyager: Building an end-to-end web agent with large multimodal models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, Bangkok, Thailand. Association for Computational Linguistics.
- Embrace divergence for richer insights: A multi-document summarization benchmark and a case study on summarizing diverse information from news articles. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 570–593, Mexico City, Mexico. Association for Computational Linguistics.
- SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations.
- Summary of a haystack: A challenge to long-context llms and rag systems. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Discord questions: A computational approach to diversity analysis in news coverage. arXiv preprint arXiv:2211.05007.
- Agentbench: Evaluating LLMs as agents. In The Twelfth International Conference on Learning Representations.
- Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities. Preprint, arXiv:2408.04682.
- Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930.
- Adrian Payne and Pennie Frow. 2005. A strategic framework for customer relationship management. Journal of marketing, 69(4):167–176.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. Preprint, arXiv:2307.16789.
- Identifying the risks of lm agents with an lm-emulated sandbox. In The Twelfth International Conference on Learning Representations.
- Salesforce. 2024. Salesforce announces the world’s first llm benchmark for crm.
- Workbench: a benchmark dataset for agents in a realistic workplace setting. In First Conference on Language Modeling.
- Russell S Winer. 2001. A framework for customer relationship management. California management review, 43(4):89–105.
- Intercode: Standardizing and benchmarking interactive coding with execution feedback. Advances in Neural Information Processing Systems, 36.
- Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, volume 35, pages 20744–20757. Curran Associates, Inc.
- Tau-bench: A benchmark for tool-agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045.
- React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
- Assistantbench: Can web agents solve realistic and time-consuming tasks? Preprint, arXiv:2407.15711.
- R-judge: Benchmarking safety risk awareness for llm agents. arXiv preprint arXiv:2401.10019.
- Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.