Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 25 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 134 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments (2411.02305v2)

Published 4 Nov 2024 in cs.CL and cs.AI

Abstract: Customer Relationship Management (CRM) systems are vital for modern enterprises, providing a foundation for managing customer interactions and data. Integrating AI agents into CRM systems can automate routine processes and enhance personalized service. However, deploying and evaluating these agents is challenging due to the lack of realistic benchmarks that reflect the complexity of real-world CRM tasks. To address this issue, we introduce CRMArena, a novel benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments. Following guidance from CRM experts and industry best practices, we designed CRMArena with nine customer service tasks distributed across three personas: service agent, analyst, and manager. The benchmark includes 16 commonly used industrial objects (e.g., account, order, knowledge article, case) with high interconnectivity, along with latent variables (e.g., complaint habits, policy violations) to simulate realistic data distributions. Experimental results reveal that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities. Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments. CRMArena is an open challenge to the community: systems that can reliably complete tasks showcase direct business value in a popular work environment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. The art of saying no: Contextual noncompliance in language models. arXiv preprint arXiv:2407.12043.
  2. Mind2web: Towards a generalist agent for the web. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  3. Workarena: How capable are web agents at solving common knowledge work tasks? In Forty-first International Conference on Machine Learning.
  4. WebVoyager: Building an end-to-end web agent with large multimodal models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, Bangkok, Thailand. Association for Computational Linguistics.
  5. Embrace divergence for richer insights: A multi-document summarization benchmark and a case study on summarizing diverse information from news articles. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 570–593, Mexico City, Mexico. Association for Computational Linguistics.
  6. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations.
  7. Summary of a haystack: A challenge to long-context llms and rag systems. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  8. Discord questions: A computational approach to diversity analysis in news coverage. arXiv preprint arXiv:2211.05007.
  9. Agentbench: Evaluating LLMs as agents. In The Twelfth International Conference on Learning Representations.
  10. Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities. Preprint, arXiv:2408.04682.
  11. Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930.
  12. Adrian Payne and Pennie Frow. 2005. A strategic framework for customer relationship management. Journal of marketing, 69(4):167–176.
  13. Toolllm: Facilitating large language models to master 16000+ real-world apis. Preprint, arXiv:2307.16789.
  14. Identifying the risks of lm agents with an lm-emulated sandbox. In The Twelfth International Conference on Learning Representations.
  15. Salesforce. 2024. Salesforce announces the world’s first llm benchmark for crm.
  16. Workbench: a benchmark dataset for agents in a realistic workplace setting. In First Conference on Language Modeling.
  17. Russell S Winer. 2001. A framework for customer relationship management. California management review, 43(4):89–105.
  18. Intercode: Standardizing and benchmarking interactive coding with execution feedback. Advances in Neural Information Processing Systems, 36.
  19. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, volume 35, pages 20744–20757. Curran Associates, Inc.
  20. Tau-bench: A benchmark for tool-agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045.
  21. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
  22. Assistantbench: Can web agents solve realistic and time-consuming tasks? Preprint, arXiv:2407.15711.
  23. R-judge: Benchmarking safety risk awareness for llm agents. arXiv preprint arXiv:2401.10019.
  24. Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations.
Citations (1)

Summary

  • The paper introduces CRMArena, a benchmark designed to evaluate LLM agents on realistic CRM tasks in a simulated Salesforce environment.
  • It employs a comprehensive CRM simulation with 16 interconnected objects, revealing that LLM agents complete less than 55% of tasks even with specialized tools.
  • The findings highlight significant limitations in current LLM agents and motivate further research to enhance their performance in professional CRM applications.

Evaluating LLM Agents in CRM Tasks with CRMArena

The paper introduces CRMArena, a benchmark specifically designed to evaluate the ability of LLM agents to perform customer relationship management (CRM) tasks within realistic professional settings. The emergence of LLM-based agents presents opportunities in automating and enhancing CRM processes, but these agents currently lack rigorous evaluation tools aligned with real-world workplace demands. CRMArena addresses this gap by simulating a professional CRM environment, hosted within a Salesforce organization, and comprising tasks typically managed by CRM personnel like service managers, agents, and analysts.

The benchmark synthesizes a comprehensive CRM environment, featuring a complex database with 16 interconnected industrial objects. These include accounts, orders, cases, and more, each with inherent dependencies and latent causal relationships. To achieve this, CRMArena utilizes a data generation pipeline, facilitated by LLMs, to ensure well-roundedness in data modeling, achieving realistic representations without relying on real enterprise data due to privacy concerns.

CRMArena evaluates agents through nine tasks across three personas: Service Manager, Service Agent, and Service Analyst. The tasks—ranging from New Case Routing to Top Issue Identification—are founded on the routine challenges these roles face in enterprises. By offering UI and API access, CRMArena simulates a sandbox environment where LLM systems can demonstrate their capabilities and measure reliability in CRM contexts.

The benchmark exposes current LLM limitations, revealing that existing state-of-the-art agents solve less than 55% of tasks even when equipped with manually-crafted tools for function-calling. These results underscore the complexity and authenticity of the CRMArena challenges, establishing its potential as a community-driven evaluation standard for LLM-based agents in business environments.

Numerical Results and Implications

CRMArena delivers a notable insight into LLM agent performance on realistic applications: without manual interventions, LLM agents achieve less than 40% task completion rates, and even with specialized tools, success barely crosses the halfway mark. These outcomes suggest a significant need for further innovation in augmenting LLMs' functional understanding and command execution capabilities within structured data systems.

The implications of this research are twofold. Practically, enhanced LLM agents capable of reliably completing tasks in CRMArena could revolutionize CRM processes by reducing manual workloads and improving efficiency. Theoretically, CRMArena provides a structured evaluation pathway for future LLM developments, focusing on better integration and functionality adherence in enterprise settings.

The work invites AI researchers to address identified limitations through agent advancement and improved benchmarking strategies, potentially leading to LLMs with more sophisticated function-calling and rule-following capabilities adaptable to diverse task environments beyond CRM.

Future Directions

This foundational benchmark will likely foster further research into the development of autonomous LLM agents suitable for complex professional environments. Researchers might explore hybrid systems integrating both rule-based and learning-based approaches to utilize CRM manipulation, extraction, and interaction. As LLMs continue evolving, CRMArena could parallel advancements by including additional CRM roles or adapting to industries with differing CRM criteria. Thus, CRMArena presents not just a benchmark but a challenge to drive innovation in CRM task automation using LLM agents.

Dice Question Streamline Icon: https://streamlinehq.com

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 9 likes.

Upgrade to Pro to view all of the tweets about this paper: