CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions (2505.18878v1)

Published 24 May 2025 in cs.CL and cs.AI

Abstract: While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.

Summary

The paper introduces CRMArena-Pro, a comprehensive benchmark designed for holistically evaluating LLM agents across diverse business tasks and CRM functions, improving upon existing benchmarks.
Experimental results show current LLM agents achieve moderate success rates, around 58% for single-turn and 35% for multi-turn tasks, highlighting challenges in complex interactions and multi-turn reasoning.
A significant finding reveals near-zero inherent confidentiality awareness in agents, which can be improved with targeted prompting but requires further research to enhance safety without compromising efficacy.

CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

The paper introduces CRMArena-Pro, a comprehensive benchmark designed for the holistic evaluation of LLM agents within varied professional settings. Addressing numerous shortcomings present in existing benchmarks, CRMArena-Pro offers a more realistic and rigorous framework to assess LLM agents' performance across a spectrum of tasks relevant to business scenarios, including Customer Relationship Management (CRM) functions. The benchmark expands upon the initial CRMArena by incorporating a broader array of tasks validated by experts, spanning across sales, service, and Business-to-Business (B2B) as well as Business-to-Customer (B2C) interactions.

Key Features and Experimental Insights

CRMArena-Pro delineates nineteen tasks categorized under four core business skills crucial to CRM systems: Workflow Execution, Policy Compliance, Information Retrieval (Textual Reasoning), and Database Querying (Numerical Computation). The expansion aims to cover crucial processes in sales and CPQ that were previously underrepresented in benchmarks restricted to B2C and customer service applications. This diversification facilitates a more comprehensive understanding of LLM capabilities in handling complex, multi-faceted business operations.

The empirical evaluation detailed in the paper reveals that current LLM agents achieve moderate performance levels, with leading models obtaining approximately 58% success rate in single-turn tasks and a sharp decline to 35% in multi-turn settings. Workflow Execution emerges as a notably tractable skill with success rates surpassing 83% for top-performing models in single-turn scenarios. However, multi-turn reasoning and confidentiality awareness remain significant challenges, reflecting the agents' limitations in sustaining high performance across complex, dynamic interactions representative of real-world enterprise needs.

Confidentiality Awareness and Implications

A crucial aspect addressed by CRMArena-Pro is confidentiality awareness—a domain often overlooked in prior benchmarks despite its importance in CRM applications. The evaluation framework assesses agents' ability to identify and adhere to data handling protocols when faced with sensitive information queries. The results indicate that LLM agents possess near-zero inherent confidentiality awareness. However, this can be improved with targeted prompting, albeit at the cost of overall task performance—a trade-off that necessitates further research into enhancing models' safety without compromising efficacy.

Future Directions and Implications

CRMArena-Pro serves as a challenging testbed for evaluating and advancing LLM agents' competence in nuanced business scenarios. The performance gaps identified highlight areas for future improvement, such as boosting multi-turn reasoning capabilities and refining confidentiality-sensitive interactions. The findings underscore the need for LLMs to evolve not just in language generation and interaction but also in contextual awareness and nuanced policy comprehension.

The benchmark's implications extend beyond CRM systems, into the broader landscape of AI deployment within enterprise contexts, advocating for balanced development that aligns sophistication in task handling with ethical standards in confidentiality. CRMArena-Pro positions itself as a pivotal tool in guiding advancements in LLM design and deployment strategy, ensuring models are both practically effective and ethically robust for real-world applications.