LLMs face significant challenges in effectively processing long contexts, which are prevalent in real-world tasks like question answering, summarization, and code completion involving documents, books, or codebases far exceeding typical LLM context windows. Existing approaches, primarily input reduction (e.g., RAG) and window extension, have limitations. Input reduction methods like RAG struggle with ensuring relevant information is always included, especially for tasks requiring reasoning across dispersed information. Window extension methods, while increasing capacity, often suffer from the "lost in the middle" problem, where models struggle to focus on pertinent details within very long inputs.
The paper "Chain of Agents: LLMs Collaborating on Long-Context Tasks" (Zhang et al., 4 Jun 2024 ) proposes Chain-of-Agents (CoA), a novel framework inspired by human-like processing, that leverages multi-agent collaboration through natural language to address these long-context challenges. CoA processes the entire input by interleaving reading and reasoning, mitigating the long context focusing issue by assigning each agent a manageable, short context.
CoA consists of two main stages:
- Worker Agent: Segment Comprehension and Chain-Communication: The long input text is split into smaller chunks , where each chunk is within the context window limit of an LLM. A sequence of worker agents processes these chunks sequentially. Each worker receives the current chunk , the original query (if applicable), a task-specific instruction , and a "communication unit" (CU) passed from the previous worker. The worker processes this input using an LLM backbone () and generates an updated communication unit to pass to the next worker: . The CU accumulates relevant information or intermediate reasoning across the chunks. The paper provides examples of CU content for different tasks in Appendix~\ref{sec:case}. This sequential communication allows the last worker to have processed information spanning the entire original input, achieving a full receptive field.
- Manager Agent: Information Integration and Response Generation: After the chain of worker agents has processed all chunks, the final communication unit is passed to a manager agent . The manager, using its own LLM backbone (), synthesizes the information in (along with the original query and manager instruction ) to generate the final response: . This separation of concerns allows workers to focus on chunk-level processing and the manager to focus on global synthesis.
Key Features and Advantages:
- Training Free: CoA is a framework built on top of existing LLMs, requiring no specific training or fine-tuning for the LLMs themselves.
- Task/Length Agnostic: The framework can be applied to various tasks (QA, summarization, code completion demonstrated) and accommodates inputs of arbitrary length by adjusting the number of worker agents.
- Highly Interpretable: The sequential CUs generated by workers provide a step-by-step trace of how information is processed and aggregated.
- Interleaved Read-Process: Unlike RAG's "read-then-process" where information is reduced before LLM processing, CoA's workers process chunks while reading the entire input sequentially.
- Mitigates Focus Issues: By limiting the context for each individual worker to a short chunk , CoA avoids the problem of an LLM needing to find information in an extremely long context window.
- Cost-Effective: The paper shows theoretically that CoA's encoding time complexity is , where is the total input length and is the agent window size. This is more efficient than the complexity of a single LLM processing the full context (if possible). Decoding time is similar for both (, where is response length). The proof is in Appendix~\ref{sec:proof}.
Implementation Details:
The paper utilizes existing commercial LLMs like PaLM 2, Gemini, and Claude 3 via APIs (specifically mentioning Google Cloud's Vertex AI Model Garden). The prompts for workers and managers are crucial for guiding their behavior and are provided in Appendix Tables~\ref{tab:query_prompt} and \ref{tab:nonquery_prompt} for query-based and non-query tasks, respectively. The input splitting is based on token limits, ensuring each chunk fits within the agent's context window .
Experimental Evaluation:
CoA was evaluated on nine long-context datasets across question answering (HotpotQA, MuSiQue, NarrativeQA, Qasper, QuALITY), summarization (QMSum, GovReport, BookSum), and code completion (RepoBench-P). Experiments were conducted using text-bison, text-unicorn (PaLM 2), gemini-ultra, and Claude 3 models (haiku, sonnet, opus) with various context window limits (8k, 32k, 200k).
Baselines included:
- Vanilla (Full-Context): Directly feeding the input to the LLM up to its context window limit (truncated if necessary).
- RAG: Using a state-of-the-art retriever (BGE embedding) to retrieve and re-rank relevant chunks, then feeding the top chunks to the LLM.
- Other Multi-Agent Frameworks: Merge (parallel workers, majority voting) and Hierarchical (tree structure, no sibling communication).
Results:
- CoA (8k) consistently and significantly outperformed Vanilla (8k) and RAG (8k) across all nine datasets and all tested LLMs (PaLM 2, Gemini). Improvements were substantial, e.g., up to 13.30% on NarrativeQA for text-bison.
- When comparing against Long Context Models (Claude 3 with 200k window), CoA (8k) still achieved significantly higher performance on datasets like NarrativeQA and BookSum. This suggests that effective processing is more critical than just having a larger window. The performance gain over Vanilla (200k) and RAG increased with stronger Claude 3 models (Haiku to Opus).
- CoA also outperformed the other multi-agent baselines (Merge and Hierarchical), demonstrating the importance of sequential communication and information aggregation among worker agents.
Analyses:
- RAG vs. CoA: Analysis on NarrativeQA showed that CoA's performance is less dependent on the retriever's accuracy. CoA showed more significant improvements over RAG when the gold answer was located in chunks that RAG failed to retrieve effectively.
- Longer Inputs: On BookSum with Claude 3, CoA's performance not only increased with longer inputs but the improvement margin over the Vanilla (200k) baseline became more pronounced for inputs exceeding 400k tokens.
- Lost-in-the-Middle: Experiments on a Natural Questions subset confirmed that CoA effectively mitigates the "lost-in-the-middle" phenomenon observed in the Vanilla model, likely because each worker only processes a short chunk.
- Multi-agent Collaboration: A case paper on HotpotQA illustrated how CoA's sequential chain allows workers to build upon previous information, enabling successful multi-hop reasoning that RAG struggled with.
- Ablation Studies: Removing the manager agent ("w/o Manager") significantly hurt performance, highlighting its crucial role in synthesizing the final output. Testing different reading orders (Right-to-Left, Permutation) showed that the natural left-to-right order generally yielded the best performance, although other orders could sometimes be better.
- Multi-path CoA: Combining multiple CoA paths (e.g., left-to-right and right-to-left, multiple random permutations) and using methods like majority voting or an LLM judge to select or combine results further enhanced performance, suggesting avenues for future work.
Limitations:
The paper acknowledges limitations, including the potential for improving communication among agents via fine-tuning or more sophisticated dialogue strategies beyond simple sequential passing of a CU. Other forms of communication like debating were not explored. The cost and latency from multiple API calls could also be areas for further optimization, perhaps through model routing or using more efficient smaller models for some agents.