MARCO: Multi-Agent Real-time Chat Orchestration (2410.21784v1)

Published 29 Oct 2024 in cs.AI, cs.CL, cs.LG, and cs.MA

Abstract: LLM advancements have enabled the development of multi-agent frameworks to tackle complex, real-world problems such as to automate tasks that require interactions with diverse tools, reasoning, and human collaboration. We present MARCO, a Multi-Agent Real-time Chat Orchestration framework for automating tasks using LLMs. MARCO addresses key challenges in utilizing LLMs for complex, multi-step task execution. It incorporates robust guardrails to steer LLM behavior, validate outputs, and recover from errors that stem from inconsistent output formatting, function and parameter hallucination, and lack of domain knowledge. Through extensive experiments we demonstrate MARCO's superior performance with 94.48% and 92.74% accuracy on task execution for Digital Restaurant Service Platform conversations and Retail conversations datasets respectively along with 44.91% improved latency and 33.71% cost reduction. We also report effects of guardrails in performance gain along with comparisons of various LLM models, both open-source and proprietary. The modular and generic design of MARCO allows it to be adapted for automating tasks across domains and to execute complex usecases through multi-turn interactions.

References (22)

Summary

The paper introduces MARCO, a multi-agent framework for orchestrating complex task automation using LLMs via a modular architecture and structured execution procedures.
MARCO enhances robustness and accuracy via guardrails like output reflection and leveraging deterministic task steps, significantly reducing errors and improving reliability.
Empirical evaluation shows MARCO improves task accuracy by approximately 30% using guardrails, while reducing latency by 44.91% and costs by 33.71% compared to single-agent systems.

MARCO: Multi-Agent Real-time Chat Orchestration

The paper introduces MARCO, a sophisticated multi-agent framework designed for the dynamic and challenging environment of task automation using LLMs. This work is significant in addressing the complexities involved in orchestrating conversations involving various tools, reasoning steps, and multi-operator interactions to achieve high accuracy in task execution. MARCO demonstrates how a modular approach, augmented with robust guardrails, can navigate the unpredictability of LLMs and overcome their intrinsic non-determinism in output generation.

Conceptual Framework and Key Features

MARCO operates on a multi-agent architecture where each task is broken down into sub-tasks, each managed by dedicated agents known as Task Agents. These agents follow a predefined Task Execution Procedure (TEP), allowing systematic execution of tasks through deterministic and reasoning steps. Central to MARCO is the Multi-Agent Reasoner and Orchestrator (MARS), which interprets queries, plans actions, and coordinates task execution using specified procedural steps and tools, or deterministic tasks.

The framework relies heavily on leveraging determinism embedded within task execution steps by encapsulating these routines as callable functions. Such steps require minimal reasoning intervention, thereby optimizing both response accuracy and latency. Task Agents are structured in a hierarchical manner, where parent agents can invoke child agents as required by the task progression, mimicking human-like reasoning and decision-making processes in task management.

Guardrails for Robustness and Error Management

A key feature of MARCO is its implementation of guardrails to mitigate the LLM's tendency towards errors like misformatted outputs, function and parameter hallucinations, and domain-specific knowledge gaps. These guardrails include techniques for output reflection, where the system prompts the LLM to reconsider and correct its output if it deviates from expected formats or logic. The use of contextual embeddings and shared dynamic memory ensures relevant state information is consistently available to agents, reducing error rates and enhancing performance reliability.

Empirical Evaluation

The framework's efficacy is validated through experiments on curated datasets—namely, the Digital Restaurant Service Platform (DRSP) and Retail conversations datasets. These datasets include scenarios that test MARCO's capability for both simple and complex task automations. Across these datasets, the CLAUDE models, particularly claude-3-sonnet, excel with accuracy rates of 94.48% and 92.74% for DRSP and Retail conversations, respectively. Notably, the implementation of guardrails accounts for an approximately 30% improvement in task accuracy. Additionally, MARCO demonstrates a reduction in latency by approximately 44.91% and a corresponding reduction in operational costs by 33.71% compared to single-agent systems.

Practical and Theoretical Implications

MARCO's modular design has significant implications for the development of task automation systems across diverse domains. Its ability to incorporate and manage complex, multi-turn interactions makes it highly adaptable and scalable. The framework also highlights critical aspects of integrating LLMs into real-time systems, such as the need for guardrails and modular task management to maximize performance and reliability. Furthermore, the paper points towards future directions in optimizing LLMs for specific domains, potentially involving further fine-tuning and adjustment of domain knowledge parameters within the model.

Conclusion and Future Prospects

Overall, MARCO represents a substantial advancement in automating complex tasks using LLMs, particularly in environments requiring intricate orchestration of conversation, reasoning, and tool interactions. Future developments could explore even more efficient guardrail mechanisms and the integration of finer-grained control over agent behavior to align more closely with human-like decision-making processes. As LLM technology evolves, frameworks like MARCO will undoubtedly play a pivotal role in realizing sophisticated AI systems that enhance productivity and operational efficiency in real-world applications.