Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

MoA is All You Need: Building LLM Research Team using Mixture of Agents (2409.07487v2)

Published 4 Sep 2024 in q-fin.CP

Abstract: LLMs research in the financial domain is particularly complex due to the sheer number of approaches proposed in literature. Retrieval-Augmented Generation (RAG) has emerged as one of the leading methods in the sector due to its inherent groundedness and data source variability. In this work, we introduce a RAG framework called Mixture of Agents (MoA) and demonstrate its viability as a practical, customizable, and highly effective approach for scaling RAG applications. MoA is essentially a layered network of individually customized small LLMs (Hoffmann et al., 2022) collaborating to answer questions and extract information. While there are many theoretical propositions for such an architecture and even a few libraries for generally applying the structure in practice, there are limited documented studies evaluating the potential of this framework considering real business constraints such as cost and speed. We find that the MoA framework, consisting of small LLMs (Hoffmann et al., 2022), produces higher quality and more grounded responses across various financial domains that are core to Vanguard's business while simultaneously maintaining low costs.

Summary

  • The paper demonstrates that the MoA framework leverages a network of customized LLM agents to enhance response quality and mitigate error propagation.
  • The paper shows that MoA scales efficiently by processing tens of thousands of documents with specialized agents, optimizing cost and context windows.
  • The paper highlights that ensemble LLM systems in the MoA approach outperform single-model setups by delivering more reliable and grounded outputs.

The paper introduces a Mixture of Agents (MoA) framework for Retrieval-Augmented Generation (RAG) applications, emphasizing its practical viability, customizability, and effectiveness in scaling RAG. The MoA system uses a layered network of customized small LLMs collaborating to answer questions and extract information. The authors posit that while theoretical frameworks and libraries exist for such architectures, documented studies evaluating their real-world business constraints, such as cost and speed, are limited. The authors find that the MoA framework produces higher quality and more grounded responses across various financial domains while maintaining low costs.

The authors note that single-model approaches are often less effective than multi-model (ensemble) approaches because ensemble models benefit from the consensus of multiple models, each receiving slightly different inputs, enhancing the confidence in predictive outcomes. Ensemble models also generalize better to new information. Recent research has shifted towards sparse ensembles of LLMs due to their lower hallucination rates, improved output quality, and enhanced information surfacing capabilities [cheng_2023_unlocking, shen_2024_learning, gordon_2023_multiai]. Arranging multiple LLMs in sequence or parallel creates intricate networks that resemble organizational structures [chuang_2023_simulating]. LLMs that can perform actions based on information from databases and APIs are termed "agents," and systems of multiple agents are referred to as "Socratic AI" or "Agentic AI" [zeng_2022_socratic]. The authors define a MoA system as an ensemble of agents, each with unique characteristics such as customized linking, prompting, and knowledge. Existing literature primarily explores ensemble LLMs theoretically, focusing on whether error improves or compounds in these systems [li_2024_more, guo_2024_large]. The main drawbacks of ensemble LLMs are cost and speed.

The paper draws inspiration from Mistral AI’s Mixture of Experts (MoE) model, Mixtral 8x16 [jiang_2024_mixtral]. While MoE applies ensemble learning within a single model, MoA applies it across multiple models. The GPT-4 model is rumored to be an impactful implementation of MoE, with "GPTs" representing OpenAI's exploration of agents. Libraries like AIFlows, Langchain, and Microsoft Autogen enable programmatic composition of agents and LLMs [introduction, josifoski_2024_flows]. The Vanguard Investment Management Fintech Strategies (IMFS) team suggests that MoA meets the constraints of cost and user experience.

The MoA framework supports specialized small LLM agents working together to answer questions. These agents operate in ways that mimic organizational hierarchies, producing higher quality outputs with transparency. The agents are information gatherers, each possessing its own internal knowledge, external knowledge bases [ding_2024_entgpt], prompts, groundings, abilities, and connections with other agents, enabling diverse views that converge to form a final response. A robust MoA system consisting of small LLMs is cost-effective and, when combined with good data engineering practices, can achieve speed and scale.

In the MoA framework, the role of the agent resembles a junior researcher, but with tremendous potential. By customizing the knowledge accessible to each agent, the system can develop diversified yet intelligent agents with domain understanding and specialization. The split-agent approach offers higher response quality compared to a single-model approach due to the customizability of each individual agent. Pipelines of agents can be constructed to complete high-level tasks efficiently, reminiscent of a research team. Agents with different customizations collaborate to tackle a common problem. A planner selects the questions, and an aggregator combines the agents’ responses. The flexibility of MoA lies in the fact that agents can be replaced by heuristics, API calls, or any other subprocess that might feed additional information into the aggregator or other agents. The system can grow arbitrarily complex. The MoA system at Vanguard’s IMFS team has scaled to analyze tens of thousands of documents simultaneously. The concept of "compounding error" only occurs with a single stream of serial models and not with MoA.

The authors find that an interwoven network of models outperforms any single workstream and that as the system scales and the layers of abstraction increase, both latency and potential grow. MoA enhances the information surfacing capabilities of any RAG implementation, thereby increasing the quality of the output.

The authors address the concern regarding context windows in RAG systems [liu_2023_lost]. MoA augments the effective context window of the system by splitting the context among multiple expert agents, allowing for a higher degree of precision and reducing the probability of "lost in the middle" issues. Customizing prompts for agents based on their data source can improve output quality and insight. Vanguard employs MoA to extract and surface insights from tens of thousands of documents.

The paper compares MoA with single-model providers, such as Anthropic’s Claude 3 Opus and OpenAI’s ChatGPT 4, using Apple’s Q1 2023 earnings transcript and 10-Q filings. The models were asked questions and graded based on the amount of vital information captured in their responses. The analysis demonstrates that a MoA system consisting of two Mistral-based agents (each with 7B parameters) competes with larger systems.

The MoA system is cost-effective and simple. In its simplest form, MoA can be performed with the same model and endpoint, activated as many times as necessary to perform inference through the various layers. The drawback of MoA is its higher demand on concurrent inference. Single-model systems can support more users because each user accesses only one endpoint, whereas MoA requires at least two endpoints per user. Vanguard IMFS’s MoA system has a lower cost compared to most third-party RAG providers, with a total run cost of under \$8,000 per month. Vanguard IMFS’s MoA system can search and surface information from over 30,000 documents in under 60 seconds using two layers of agents. The latency penalty for implementing MoA is approximately 4.07x, or 2.24x when running inference in parallel. The speed and context window improvement of MoA scales linearly with the number of models used in the system.

The authors conclude that MoA using small LLMs should be the standard for enterprise-grade RAG pipelines. Performance may be improved by employing more efficient cost-per-token providers such as Fireworks AI or Groq.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com