Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can a Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model (2502.08820v3)

Published 12 Feb 2025 in cs.AI and cs.CL

Abstract: LLMs with API-calling capabilities enabled building effective Language Agents (LA), while also revolutionizing the conventional task-oriented dialogue (TOD) paradigm. However, current approaches face a critical dilemma: TOD systems are often trained on a limited set of target APIs, requiring new data to maintain their quality when interfacing with new services, while LAs are not trained to maintain user intent over multi-turn conversations. Because both robust multi-turn management and advanced function calling are crucial for effective conversational agents, we evaluate these skills on three popular benchmarks: MultiWOZ 2.4 (TOD), BFCL V3 (LA), and API-Bank (LA), and our analyses reveal that specialized approaches excel in one domain but underperform in the other. To bridge this chasm, we introduce CoALM (Conversational Agentic LLM), a unified approach that integrates both conversational and agentic capabilities. We created CoALM-IT, a carefully constructed multi-task dataset that interleave multi-turn ReAct reasoning with complex API usage. Using CoALM-IT, we train three models CoALM 8B, CoALM 70B, and CoALM 405B, which outperform top domain-specific models, including GPT-4o, across all three benchmarks. This demonstrates the feasibility of a single model approach for both TOD and LA, setting a new standard for conversational agents.

CALM: A Unified Conversational Agentic LLM

The paper presents the Conversational Agentic LLM (CALM), a comprehensive solution aimed at bridging a significant gap in the landscape of LLMs, particularly in the field of Task-Oriented Dialogue (TOD) systems and Language Agents (LA). The traditional TOD systems, typically trained on a narrow set of APIs, excel in maintaining user intent across multiple dialog turns but falter when needing to engage with a wide variety of APIs. Conversely, current language agents demonstrate proficiency in function calling but lack the capacity to manage context over multiple turns. This dichotomy motivates the introduction of CALM, which seeks to unify both functionalities into a single robust system.

The paper begins with an evaluative process employing three well-established benchmarks to demonstrate the need for a unified approach. These benchmarks—MultiWOZ 2.4 for TOD, and BFCL V3 and API-Bank for LA—reveal the specialization and limitations of existing systems. CALM is trained on CALM-IT, a distinct dataset that integrates multi-turn ReAct reasoning with complex API usage, designed to strengthen both conversational and agentic skills.

CALM achieves strong numerical results, significantly outperforming leading domain-specific models such as GPT-4o across all benchmarks. Notably, three models—CALM 8B, CALM 70B, and CALM 405B—showcase the capability of CALM to outperform its predecessors. This is a result acquired by interleaving training on a specialized dataset with aligned optimization objectives. Moreover, the results hint at the closing gap between open-source and proprietary systems in high-demand language processing tasks.

The authors identify key domain-specific strengths inherent to CALM: the model's sustained performance in the MultiWOZ 2.4 task indicates effective management of user intents in multi-turn conversations. Simultaneously, CALM's performance on LA benchmarks, including the API-Bank and BFCL V3, underscores its ability in executing complex function calling scenarios, often involving multiple simultaneous or sequential tool usages.

This paper makes substantial contributions to both practical applications and theoretical understandings of AI. For practical purposes, CALM offers a model that can seamlessly interweave the capability to conduct complex, multi-turn dialogues with the flexibility and adaptability of a robust LA. The design principles and results of CALM suggest an evolution in the development of conversational agents, leaning toward models that reduce the need for intense fine-tuning or piecemeal training datasets, ultimately supporting more fluid, human-like dialogue systems in real-world environments.

Theoretically, the work fosters future research exploring intersections in conversation management and tool usage, hinting at models capable of evolving with minimal human intervention. The unified model opens up discussions on the potential for further integration with frameworks like reinforcement learning, which might catalyze adaptive growth and accuracy in session handling without constant manual retraining.

Looking forward, developments based on the CALM framework may well propel advancements in creating conversational systems that more closely mimic human interactions, catering to the nuanced needs of users while efficiently leveraging diverse sets of APIs dynamically. This paper sets a foundation not only for enhancing user-agent interaction but also for pioneering more agile infrastructures in conversational and agentive AI technologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Emre Can Acikgoz (11 papers)
  2. Jeremiah Greer (2 papers)
  3. Akul Datta (4 papers)
  4. Ze Yang (51 papers)
  5. William Zeng (14 papers)
  6. Oussama Elachqar (5 papers)
  7. Emmanouil Koukoumidis (3 papers)
  8. Dilek Hakkani-Tür (164 papers)
  9. Gokhan Tur (47 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com