- The paper introduces TalkHier, a framework for LLM multi-agent systems that uses structured communication and hierarchical refinement to improve collaboration and performance on complex tasks.
- Evaluated on benchmarks like MMLU and WikiQA, TalkHier achieved higher accuracy and better performance metrics compared to various single-agent, multi-agent, and proprietary baselines.
- Ablation studies demonstrated that both the structured communication protocol and the hierarchical refinement component are essential for TalkHier's superior performance, despite the framework's relatively high API cost.
The paper introduces Talk Structurally, Act Hierarchically (TalkHier), a novel collaborative LLM-MA (LLM-based Multi-Agent) framework designed to enhance communication and refinement among agents working on complex tasks. TalkHier integrates a structured communication protocol with a hierarchical refinement system, addressing the limitations of disorganized communication and ineffective refinement schemes in existing LLM-MA systems.
The key contributions of TalkHier are:
The methodology of TalkHier involves representing the LLM-MA system as a graph G=(V,E), where V denotes the set of agents (nodes) and E represents the set of communication pathways (edges). Each agent vi∈V is defined by its role Rolei, plugins Pluginsi, memory Memoryi, and type Typei, specifying whether the agent is a Supervisor (S) or a Member (M).
The hierarchical structure of agents is defined as:
Vmain={vmainS,vmainGen,vevalS,vmainRev},
Veval={vevalS,vevalE1,vevalE2,…,vevalEk},
where the Main Supervisor (vmainS) and Evaluation Supervisor (vevalS) oversee their respective team’s operations and assign tasks to each member. The Generator (vmainGen) provides solutions for a given problem, and the Revisor (vmainRev) refines outputs based on given feedback. The evaluation team is composed of k independent evaluators vevalEk, each of which outputs evaluation results for a given problem based on their specified metric.
The hierarchical refinement process is detailed in Algorithm 1 of the paper, involving task assignment, distribution, evaluation, feedback aggregation, and revision. The process continues iteratively until a quality threshold Mthreshold is met or a maximum number of iterations Tmax is reached.
The paper explores several research questions:
- RQ1: Does TalkHier outperform existing multi-agent, single-agent, and proprietary approaches on general benchmarks?
- RQ2: How does TalkHier perform on open-domain question-answering tasks?
- RQ3: What is the contribution of each component of TalkHier to its overall performance?
- RQ4: How well does TalkHier generalize to more practical but complex generation task?
TalkHier was evaluated on a diverse set of benchmarks: the Massive Multitask Language Understanding (MMLU) benchmark, WikiQA, and a camera dataset for advertisement text generation. The baselines included GPT-4o, OpenAI-o1-preview, ReAct, AutoGPT, AgentVerse, GPTSwarm, and AgentPrune.
The results on the MMLU dataset showed that TalkHier achieves the highest average accuracy (88.38\%) across five domains, outperforming open-source multi-agent models like AgentVerse (83.66\%) and majority voting strategies applied to LLM and single-agent baselines. On the WikiQA dataset, TalkHier outperformed baselines in both Rouge-1 (0.3461) and BERTScore (0.6079), demonstrating its ability to generate accurate and semantically relevant answers.
Ablation studies demonstrated the contribution of individual components in TalkHier. Removing the evaluation supervisor caused a significant drop in accuracy, underscoring the necessity of the hierarchical refinement approach. Ablation studies on the communication protocol revealed that removing intermediate outputs or background information leads to inferior performance. Evaluation on the camera dataset showed that TalkHier outperforms baselines in Faithfulness, Fluency, and Attractiveness. The mean performance gain over the best-performing baseline, OKG, across all metrics is approximately 17.63%.
The paper identifies a limitation of TalkHier as the relatively high API (Application Programming Interface) cost associated with the experiments. The structured interaction among multiple agents increases computational expenses, raising concerns about the accessibility of LLM research for researchers with limited resources.