This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs (2503.05856v1)

Published 7 Mar 2025 in cs.CL and cs.AI

Abstract: Mixture of LLM (LLMs) Agents (MoA) architectures achieve state-of-the-art performance on prominent benchmarks like AlpacaEval 2.0 by leveraging the collaboration of multiple LLMs at inference time. Despite these successes, an evaluation of the safety and reliability of MoA is missing. We present the first comprehensive study of MoA's robustness against deceptive LLM agents that deliberately provide misleading responses. We examine factors like the propagation of deceptive information, model size, and information availability, and uncover critical vulnerabilities. On AlpacaEval 2.0, the popular LLaMA 3.1-70B model achieves a length-controlled Win Rate (LC WR) of 49.2% when coupled with 3-layer MoA (6 LLM agents). However, we demonstrate that introducing only a $\textit{single}$ carefully-instructed deceptive agent into the MoA can reduce performance to 37.9%, effectively nullifying all MoA gains. On QuALITY, a multiple-choice comprehension task, the impact is also severe, with accuracy plummeting by a staggering 48.5%. Inspired in part by the historical Doge of Venice voting process, designed to minimize influence and deception, we propose a range of unsupervised defense mechanisms that recover most of the lost performance.

PDF Abstract

Overview of "This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs"

The paper "This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs" presents a rigorous investigation into the robustness and vulnerabilities of Mixture of Agents (MoA) architectures, specifically within the context of LLMs. This research is significant as it provides the first comprehensive paper on how these architectures handle intentional deception by deceptive agents.

MoA architectures leverage the collaboration of multiple LLM agents to achieve state-of-the-art performances, as evidenced by high rankings in benchmarks like AlpacaEval 2.0. Yet, the authors raise an imperative concern: the potential for deceptive agents to compromise the integrity and reliability of the system, particularly in scenarios where these agents might deliberately provide misleading information.

Key Findings

Vulnerability to Deceptive Agents: A critical finding of this paper is the susceptibility of MoA architectures to a single deceptive agent. This deception can drastically reduce the system's performance. For instance, the 3-layer MoA architecture, which initially achieves an impressive Length-Controlled Win Rate (LC WR) of 49.2% on AlpacaEval 2.0, sees this rate plummet to 37.9% upon the introduction of a single deceptive agent. In another noted case, performance on the QuALITY benchmark is more severely impacted, with accuracy falling by 48.5%.
Impact of Agent Diversity and Size: The research highlights that while agent diversity and size play significant roles in enhancing performance through varied perspectives, they also create avenues for increased vulnerability. Differences in model sizes within MoA systems could potentially exacerbate adverse effects when deceptive agents are introduced.
Decentralized Deployment and Partial Information: The decentralized nature of MoA, while beneficial for computational efficiency and diversity, is shown to introduce critical robustness issues, especially when agents have access to only partial information. This fragmentation can be exploited by malicious agents to nullify collective gains.
Defense Mechanisms: Inspired by historical mechanisms like Venice's Doge election process, the authors propose several unsupervised defense strategies to mitigate the impact of deceptive agents. These include strategies focused on redundancy and transparency to counterbalance undue influence and recover performance losses effectively.

Implications and Future Directions

This paper presents both theoretical and practical implications for future applications of LLMs in collaborative environments. On the practical side, understanding and mitigating the effects of deceptive agents is crucial for the deployment of LLMs in sensitive areas such as healthcare, legal systems, and education. Theoretically, the insights into agent interactions under deception can inform the development of more resilient LLM systems.

Moving forward, further exploration into adversarial resilience in LLM architectures is warranted. Developing standardized safety evaluations and robustifying MoA systems could be pivotal in ensuring the safe and reliable deployment of AI systems in diverse domains. Additionally, the design of defense mechanisms tailored to real-world deployment conditions remains an important area for future research.

The findings of this paper underscore the necessity of balancing diversification with systemic integrity in multi-agent AI systems, ensuring that the benefits of collaboration do not come at the cost of reliability and trustworthiness.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Lorenz Wolf (5 papers)
Sangwoong Yoon (11 papers)
Ilija Bogunovic (44 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/lorenz_wlf/status/1899470560217719020

https://twitter.com/fly51fly/status/1901028548644892767

https://twitter.com/lorenz_wlf/status/1899470572859380031

https://twitter.com/ilijabogunovic/status/1903019332382163248

https://twitter.com/GptMaestro/status/1900766764495237299