Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems (2502.11098v1)

Published 16 Feb 2025 in cs.AI, cs.LG, and cs.MA

Abstract: Recent advancements in LLM-based multi-agent (LLM-MA) systems have shown promise, yet significant challenges remain in managing communication and refinement when agents collaborate on complex tasks. In this paper, we propose \textit{Talk Structurally, Act Hierarchically (TalkHier)}, a novel framework that introduces a structured communication protocol for context-rich exchanges and a hierarchical refinement system to address issues such as incorrect outputs, falsehoods, and biases. \textit{TalkHier} surpasses various types of SoTA, including inference scaling model (OpenAI-o1), open-source multi-agent models (e.g., AgentVerse), and majority voting strategies on current LLM and single-agent baselines (e.g., ReAct, GPT4o), across diverse tasks, including open-domain question answering, domain-specific selective questioning, and practical advertisement text generation. These results highlight its potential to set a new standard for LLM-MA systems, paving the way for more effective, adaptable, and collaborative multi-agent frameworks. The code is available https://github.com/sony/talkhier.

Summary

The paper introduces TalkHier, a framework for LLM multi-agent systems that uses structured communication and hierarchical refinement to improve collaboration and performance on complex tasks.
Evaluated on benchmarks like MMLU and WikiQA, TalkHier achieved higher accuracy and better performance metrics compared to various single-agent, multi-agent, and proprietary baselines.
Ablation studies demonstrated that both the structured communication protocol and the hierarchical refinement component are essential for TalkHier's superior performance, despite the framework's relatively high API cost.

The paper introduces Talk Structurally, Act Hierarchically (TalkHier), a novel collaborative LLM-MA (LLM-based Multi-Agent) framework designed to enhance communication and refinement among agents working on complex tasks. TalkHier integrates a structured communication protocol with a hierarchical refinement system, addressing the limitations of disorganized communication and ineffective refinement schemes in existing LLM-MA systems.

The key contributions of TalkHier are:

A well-structured, context-rich communication protocol that incorporates messages $\mathbf{M}_{ij}^{(t)}$ $M_{ij}^{(t)}$ , intermediate outputs $\mathbf{I}_{ij}^{(t)}$ $I_{ij}^{(t)}$ , and relevant background information $\mathbf{B}_{ij}^{(t)}$ $B_{ij}^{(t)}$ to ensure clarity and precision in agent communication. The communication between agents is represented by communication events $c_{ij}^{(t)} \in \mathcal{C}_p$ $c_{ij}^{(t)} \in C_{p}$ , where each event $c_{ij}^{(t)}$ $c_{ij}^{(t)}$ encapsulates the interaction from agent $v_i$ $v_{i}$ to agent $v_j$ $v_{j}$ along an edge $e_{ij} \in \mathcal{E}$ $e_{ij} \in E$ at time step $t$ $t$ . Formally, a communication event $c_{ij}^{(t)}$ $c_{ij}^{(t)}$ is defined as:

$c_{ij}^{(t)} = ({ \mathbf{M}_{ij}^{(t)}, \mathbf{B}_{ij}^{(t)}, \mathbf{I}_{ij}^{(t)})$, where:
- $\mathbf{M}_{ij}^{(t)}$ is the message content sent from $v_i$ to $v_j$ , containing instructions or clarifications.
- $\mathbf{B}_{ij}^{(t)}$ is background information to ensure coherence and task progression, including the problem's core details and intermediate decisions.
- $\mathbf{I}_{ij}^{(t)}$ refers to the intermediate output generated by $v_i$ , shared with $v_j$ to support task progression and traceability, all at time step $t$ .
A hierarchical refinement framework that enhances multi-agent evaluation systems, enabling agents to act hierarchically. This approach addresses the difficulty in summarizing opinions or feedback as the number of agents increases and mitigates biases caused by the order of feedback processing.
Each agent $v_i$ maintains an independent memory, $Memory_i$ , to enhance efficiency and scalability. This agent-specific memory allows each agent to independently retain and reason on its past interactions and knowledge.

The methodology of TalkHier involves representing the LLM-MA system as a graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ , where $\mathcal{V}$ denotes the set of agents (nodes) and $\mathcal{E}$ represents the set of communication pathways (edges). Each agent $v_i \in \mathcal{V}$ is defined by its role $Role_i$ , plugins $Plugins_i$ , memory $Memory_i$ , and type $Type_i$ , specifying whether the agent is a Supervisor ( $S$ ) or a Member ( $M$ ).

The hierarchical structure of agents is defined as: $\mathcal{V}_\text{main} = \{v_\text{main}^S, v_\text{main}^\text{Gen}, v_\text{eval}^S, v_\text{main}^\text{Rev}\}$ ,

$\mathcal{V}_\text{eval} = \{v_\text{eval}^S, v_\text{eval}^{E_1}, v_\text{eval}^{E_2}, \ldots, v_\text{eval}^{E_k}\}$ ,

where the Main Supervisor ( $v_\text{main}^S$ ) and Evaluation Supervisor ( $v_\text{eval}^S$ ) oversee their respective team’s operations and assign tasks to each member. The Generator ( $v_\text{main}^\text{Gen}$ ) provides solutions for a given problem, and the Revisor ( $v_\text{main}^\text{Rev}$ ) refines outputs based on given feedback. The evaluation team is composed of $k$ independent evaluators $v_\text{eval}^{E_k}$ , each of which outputs evaluation results for a given problem based on their specified metric.

The hierarchical refinement process is detailed in Algorithm 1 of the paper, involving task assignment, distribution, evaluation, feedback aggregation, and revision. The process continues iteratively until a quality threshold $\mathcal{M}_\text{threshold}$ is met or a maximum number of iterations $T_\text{max}$ is reached.

The paper explores several research questions:

RQ1: Does TalkHier outperform existing multi-agent, single-agent, and proprietary approaches on general benchmarks?
RQ2: How does TalkHier perform on open-domain question-answering tasks?
RQ3: What is the contribution of each component of TalkHier to its overall performance?
RQ4: How well does TalkHier generalize to more practical but complex generation task?

TalkHier was evaluated on a diverse set of benchmarks: the Massive Multitask Language Understanding (MMLU) benchmark, WikiQA, and a camera dataset for advertisement text generation. The baselines included GPT-4o, OpenAI-o1-preview, ReAct, AutoGPT, AgentVerse, GPTSwarm, and AgentPrune.

The results on the MMLU dataset showed that TalkHier achieves the highest average accuracy (88.38\%) across five domains, outperforming open-source multi-agent models like AgentVerse (83.66\%) and majority voting strategies applied to LLM and single-agent baselines. On the WikiQA dataset, TalkHier outperformed baselines in both Rouge-1 (0.3461) and BERTScore (0.6079), demonstrating its ability to generate accurate and semantically relevant answers.

Ablation studies demonstrated the contribution of individual components in TalkHier. Removing the evaluation supervisor caused a significant drop in accuracy, underscoring the necessity of the hierarchical refinement approach. Ablation studies on the communication protocol revealed that removing intermediate outputs or background information leads to inferior performance. Evaluation on the camera dataset showed that TalkHier outperforms baselines in Faithfulness, Fluency, and Attractiveness. The mean performance gain over the best-performing baseline, OKG, across all metrics is approximately 17.63%.

The paper identifies a limitation of TalkHier as the relatively high API (Application Programming Interface) cost associated with the experiments. The structured interaction among multiple agents increases computational expenses, raising concerns about the accessibility of LLM research for researchers with limited resources.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - sony/talkhier: Official Repo for The Paper "Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems"

Tweets

https://twitter.com/_akhaliq/status/1891712347984531599

YouTube

Show All Videos