Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distributed Mixture-of-Agents for Edge Inference with Large Language Models (2412.21200v1)

Published 30 Dec 2024 in cs.IT, cs.CL, cs.DC, cs.LG, cs.NI, and math.IT

Abstract: Mixture-of-Agents (MoA) has recently been proposed as a method to enhance performance of LLMs, enabling multiple individual LLMs to work together for collaborative inference. This collaborative approach results in improved responses to user prompts compared to relying on a single LLM. In this paper, we consider such an MoA architecture in a distributed setting, where LLMs operate on individual edge devices, each uniquely associated with a user and equipped with its own distributed computing power. These devices exchange information using decentralized gossip algorithms, allowing different device nodes to talk without the supervision of a centralized server. In the considered setup, different users have their own LLM models to address user prompts. Additionally, the devices gossip either their own user-specific prompts or augmented prompts to generate more refined answers to certain queries. User prompts are temporarily stored in the device queues when their corresponding LLMs are busy. Given the memory limitations of edge devices, it is crucial to ensure that the average queue sizes in the system remain bounded. In this paper, we address this by theoretically calculating the queuing stability conditions for the device queues under reasonable assumptions, which we validate experimentally as well. Further, we demonstrate through experiments, leveraging open-source LLMs for the implementation of distributed MoA, that certain MoA configurations produce higher-quality responses compared to others, as evaluated on AlpacaEval 2.0 benchmark. The implementation is available at: https://github.com/purbeshmitra/distributed_moa.

Distributed Mixture-of-Agents for Edge Inference with LLMs

The paper under review introduces an innovative architectural framework termed "Distributed Mixture-of-Agents" (MoA), focusing on edge inference with LLMs. This research explores a distributed environment where LLMs are situated on individual edge devices, each associated with a specific user. These devices leverage decentralized gossip algorithms for communication, eliminating the need for a centralized server and enhancing system robustness.

Overview and Methodology

The research builds on the concept of Mixture-of-Agents, which involves multiple LLMs working collaboratively to improve the quality of responses to user prompts. In the distributed setup proposed, each edge device independently manages a user-specific LLM, processing incoming prompts in concert with other devices through a gossip-based communication protocol. This process enables better resource utilization and augments the response quality through semantic exchanges among LLMs.

The LLMs on edge devices exchange user prompts or enriched prompts with their peers, aiming for a more refined inference. The paper provides an in-depth theoretical analysis of queuing stability, ensuring that the system maintains bounded average queue sizes despite the limited memory resources of edge devices. This theoretical work is supported by experimental validation, where various configurations of the distributed MoA system were tested using open-source LLMs on the AlpacaEval 2.0 benchmark. The results reveal that certain MoA configurations yield superior response quality.

Key Numerical Findings and Implications

The numerical results presented in the paper indicate a significant improvement in response accuracy when LLMs are employed in a distributed MoA configuration. The experiments detail the trade-offs between accuracy, latency, and average queue size across different MoA setups, demonstrating measurable gains in response quality as more agents are involved in the inference process. Notably, increasing the number of layers and proposers results in enhanced accuracy but also higher system latency and larger average queue sizes.

Theoretical implication revolves around achieving maximum possible accuracy while keeping latency bounded through queuing stability. By explicitly calculating stability conditions, the research enables the deployment of edge-based collaborative LLMs without overwhelming the available resources, thus paving the way for scalable applications in environments that demand highly responsive and autonomous systems.

Future Directions in AI

The distributed MoA approach introduces a compelling paradigm for future AI systems, particularly in edge computing and IoT networks where decentralization and efficient resource utilization are critical. Future research directions could explore the integration of more sophisticated gossip strategies for reducing communication overhead and latency, or the implementation of privacy-preserving mechanisms to safeguard user data exchanges among edge devices. Moreover, consideration of heterogeneous LLM capabilities and their optimization for varying computational resources could further enhance the efficacy of such distributed systems.

In conclusion, the paper offers a substantial contribution to the field of distributed AI by extending the collaborative potential of LLMs in edge environments, with significant implications for enhancing the scalability and robustness of real-world AI applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Purbesh Mitra (9 papers)
  2. Priyanka Kaswan (18 papers)
  3. Sennur Ulukus (258 papers)