Chain of Agents: Large Language Models Collaborating on Long-Context Tasks (2406.02818v1)

Published 4 Jun 2024 in cs.CL

Abstract: Addressing the challenge of effectively processing long contexts has become a critical issue for LLMs. Two common strategies have emerged: 1) reducing the input length, such as retrieving relevant chunks by Retrieval-Augmented Generation (RAG), and 2) expanding the context window limit of LLMs. However, both strategies have drawbacks: input reduction has no guarantee of covering the part with needed information, while window extension struggles with focusing on the pertinent information for solving the task. To mitigate these limitations, we propose Chain-of-Agents (CoA), a novel framework that harnesses multi-agent collaboration through natural language to enable information aggregation and context reasoning across various LLMs over long-context tasks. CoA consists of multiple worker agents who sequentially communicate to handle different segmented portions of the text, followed by a manager agent who synthesizes these contributions into a coherent final output. CoA processes the entire input by interleaving reading and reasoning, and it mitigates long context focus issues by assigning each agent a short context. We perform comprehensive evaluation of CoA on a wide range of long-context tasks in question answering, summarization, and code completion, demonstrating significant improvements by up to 10% over strong baselines of RAG, Full-Context, and multi-agent LLMs.

PDF HTML Abstract

LLMs face significant challenges in effectively processing long contexts, which are prevalent in real-world tasks like question answering, summarization, and code completion involving documents, books, or codebases far exceeding typical LLM context windows. Existing approaches, primarily input reduction (e.g., RAG) and window extension, have limitations. Input reduction methods like RAG struggle with ensuring relevant information is always included, especially for tasks requiring reasoning across dispersed information. Window extension methods, while increasing capacity, often suffer from the "lost in the middle" problem, where models struggle to focus on pertinent details within very long inputs.

The paper "Chain of Agents: LLMs Collaborating on Long-Context Tasks" (Zhang et al., 4 Jun 2024 ) proposes Chain-of-Agents (CoA), a novel framework inspired by human-like processing, that leverages multi-agent collaboration through natural language to address these long-context challenges. CoA processes the entire input by interleaving reading and reasoning, mitigating the long context focusing issue by assigning each agent a manageable, short context.

CoA consists of two main stages:

Worker Agent: Segment Comprehension and Chain-Communication: The long input text $x$ is split into smaller chunks $\{c_1, c_2, \dots, c_l\}$ , where each chunk $c_i$ is within the context window limit $k$ of an LLM. A sequence of worker agents $W_1, W_2, \dots, W_l$ processes these chunks sequentially. Each worker $W_i$ receives the current chunk $c_i$ , the original query $q$ (if applicable), a task-specific instruction $I_W$ , and a "communication unit" (CU) $CU_{i-1}$ passed from the previous worker. The worker processes this input using an LLM backbone ( $\text{LLM}_{W_i}$ ) and generates an updated communication unit $CU_i$ to pass to the next worker: $CU_{i} = \text{LLM}_{W_i}(I_W, CU_{i-1}, c_i, q)$ . The CU accumulates relevant information or intermediate reasoning across the chunks. The paper provides examples of CU content for different tasks in Appendix~\ref{sec:case}. This sequential communication allows the last worker to have processed information spanning the entire original input, achieving a full receptive field.
Manager Agent: Information Integration and Response Generation: After the chain of worker agents has processed all chunks, the final communication unit $CU_l$ is passed to a manager agent $M$ . The manager, using its own LLM backbone ( $\text{LLM}_M$ ), synthesizes the information in $CU_l$ (along with the original query $q$ and manager instruction $I_M$ ) to generate the final response: $Response = \text{LLM}_M(I_M, CU_l, q)$ . This separation of concerns allows workers to focus on chunk-level processing and the manager to focus on global synthesis.

Key Features and Advantages:

Training Free: CoA is a framework built on top of existing LLMs, requiring no specific training or fine-tuning for the LLMs themselves.
Task/Length Agnostic: The framework can be applied to various tasks (QA, summarization, code completion demonstrated) and accommodates inputs of arbitrary length by adjusting the number of worker agents.
Highly Interpretable: The sequential CUs generated by workers provide a step-by-step trace of how information is processed and aggregated.
Interleaved Read-Process: Unlike RAG's "read-then-process" where information is reduced before LLM processing, CoA's workers process chunks while reading the entire input sequentially.
Mitigates Focus Issues: By limiting the context for each individual worker to a short chunk $k$ , CoA avoids the problem of an LLM needing to find information in an extremely long context window.
Cost-Effective: The paper shows theoretically that CoA's encoding time complexity is $\mathcal{O}(nk)$ , where $n$ is the total input length and $k$ is the agent window size. This is more efficient than the $\mathcal{O}(n^2)$ complexity of a single LLM processing the full context (if possible). Decoding time is similar for both ( $\mathcal{O}(nr)$ , where $r$ is response length). The proof is in Appendix~\ref{sec:proof}.

Implementation Details:

The paper utilizes existing commercial LLMs like PaLM 2, Gemini, and Claude 3 via APIs (specifically mentioning Google Cloud's Vertex AI Model Garden). The prompts for workers and managers are crucial for guiding their behavior and are provided in Appendix Tables~\ref{tab:query_prompt} and \ref{tab:nonquery_prompt} for query-based and non-query tasks, respectively. The input splitting is based on token limits, ensuring each chunk fits within the agent's context window $k$ .

Experimental Evaluation:

CoA was evaluated on nine long-context datasets across question answering (HotpotQA, MuSiQue, NarrativeQA, Qasper, QuALITY), summarization (QMSum, GovReport, BookSum), and code completion (RepoBench-P). Experiments were conducted using text-bison, text-unicorn (PaLM 2), gemini-ultra, and Claude 3 models (haiku, sonnet, opus) with various context window limits (8k, 32k, 200k).

Baselines included:

Vanilla (Full-Context): Directly feeding the input to the LLM up to its context window limit (truncated if necessary).
RAG: Using a state-of-the-art retriever (BGE embedding) to retrieve and re-rank relevant chunks, then feeding the top chunks to the LLM.
Other Multi-Agent Frameworks: Merge (parallel workers, majority voting) and Hierarchical (tree structure, no sibling communication).

Results:

CoA (8k) consistently and significantly outperformed Vanilla (8k) and RAG (8k) across all nine datasets and all tested LLMs (PaLM 2, Gemini). Improvements were substantial, e.g., up to 13.30% on NarrativeQA for text-bison.
When comparing against Long Context Models (Claude 3 with 200k window), CoA (8k) still achieved significantly higher performance on datasets like NarrativeQA and BookSum. This suggests that effective processing is more critical than just having a larger window. The performance gain over Vanilla (200k) and RAG increased with stronger Claude 3 models (Haiku to Opus).
CoA also outperformed the other multi-agent baselines (Merge and Hierarchical), demonstrating the importance of sequential communication and information aggregation among worker agents.

Analyses:

RAG vs. CoA: Analysis on NarrativeQA showed that CoA's performance is less dependent on the retriever's accuracy. CoA showed more significant improvements over RAG when the gold answer was located in chunks that RAG failed to retrieve effectively.
Longer Inputs: On BookSum with Claude 3, CoA's performance not only increased with longer inputs but the improvement margin over the Vanilla (200k) baseline became more pronounced for inputs exceeding 400k tokens.
Lost-in-the-Middle: Experiments on a Natural Questions subset confirmed that CoA effectively mitigates the "lost-in-the-middle" phenomenon observed in the Vanilla model, likely because each worker only processes a short chunk.
Multi-agent Collaboration: A case paper on HotpotQA illustrated how CoA's sequential chain allows workers to build upon previous information, enabling successful multi-hop reasoning that RAG struggled with.
Ablation Studies: Removing the manager agent ("w/o Manager") significantly hurt performance, highlighting its crucial role in synthesizing the final output. Testing different reading orders (Right-to-Left, Permutation) showed that the natural left-to-right order generally yielded the best performance, although other orders could sometimes be better.
Multi-path CoA: Combining multiple CoA paths (e.g., left-to-right and right-to-left, multiple random permutations) and using methods like majority voting or an LLM judge to select or combine results further enhanced performance, suggesting avenues for future work.

Limitations:

The paper acknowledges limitations, including the potential for improving communication among agents via fine-tuning or more sophisticated dialogue strategies beyond simple sequential passing of a CU. Other forms of communication like debating were not explored. The cost and latency from multiple API calls could also be areas for further optimization, perhaps through model routing or using more efficient smaller models for some agents.