The Consensus Trap: Rescuing Multi-Agent LLMs from Adversarial Majorities via Token-Level Collaboration

Published 18 Apr 2026 in cs.CL, cs.AI, and cs.MA | (2604.17139v1)

Abstract: Multi-agent LLM architectures increasingly rely on response-level aggregation, such as Majority Voting (MAJ), to raise reasoning ceilings. However, in open environments, agents are highly susceptible to stealthy contextual corruption, such as targeted prompt injections. We reveal a critical structural vulnerability in current multi-agent systems: response-level aggregation collapses when corrupted agents form a local majority. Because voting aggregates fully-formed conclusions, it is blind to flawed intermediate logic. To overcome this systematic limitation, we propose the Token-Level Round-Robin (RR) Collaboration, where agents sequentially interleave generation within a shared auto-regressive context. We formalize this process as a discrete-time dynamical system, proving that token-level interleaving transitions aggregation from a brittle counting of final votes (a linear sum) to a dynamic, interwoven chain of logic (a non-linear operator product). Through this theoretical lens, we prove that the honest model's restorative pull can overpower adversarial corruptions, even when corrupted agents form a majority. We conduct an exhaustive empirical evaluation across diverse reasoning benchmarks and demonstrate that while MAJ collapses when corrupted agents reach a majority, RR maintains robust accuracy well beyond this critical threshold.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces token-level round-robin collaboration to mitigate the consensus trap in LLM ensembles by enabling mid-sequence correction of adversarial errors.
It demonstrates, both theoretically and empirically, that token-level methods can tolerate over 50% adversarial corruption while maintaining cost-neutrality compared to traditional methods.
Empirical benchmarks reveal dramatic accuracy gains over majority voting, highlighting the method’s robustness against prompt injection and context corruption.

Rescuing Multi-Agent LLMs from the Consensus Trap via Token-Level Collaboration

Introduction

Recent advancements in multi-agent LLM architectures have leveraged ensemble-based reasoning, often aggregating decisions using response-level mechanisms like Majority Voting (MAJ). While such ensembles theoretically dampen individual agent hallucinations or errors, this paradigm breaks under real-world conditions where adversarial prompt injection or context corruption can induce correlated errors among agents. The paper "The Consensus Trap: Rescuing Multi-Agent LLMs from Adversarial Majorities via Token-Level Collaboration" (2604.17139) provides a detailed theoretical and empirical analysis of this phenomenon. It introduces Token-Level Round-Robin (RR) collaboration—a method where agents iteratively construct a shared auto-regressive context by taking turns generating fixed-size token chunks. This paradigm shift robustly counters adversarial majorities and demonstrates super-majority resilience, supported by both formal dynamical system analysis and extensive benchmark evaluation.

Figure 1: Token-level round-robin collaboration enables mid-sequence correction of adversarial errors, providing robustness against consensus collapse.

The Consensus Trap in Multi-Agent LLM Aggregation

MAJ and similar response-level aggregation methods assume independence in agent errors, an assumption rarely satisfied in open environments. When correlated corruptions arise—such as shared prompt injections, latent advertising, or synchronized manipulation across API providers—even a small majority of compromised agents leads to a deterministic collapse: the ensemble’s output converges on the adversarial response. The paper formalizes this failure as the "Consensus Trap", showing via a theoretical impossibility result (Proposition: Impossibility Trinity) that no anonymous, symmetric outcome-level aggregation is robust to both minority and slight majority corruption.

Furthermore, response-level mechanisms are blind to intermediate reasoning. Once corrupt agents control more than half of the outputs, the system is guaranteed to select the compromised answer—even if honest agents’ internal logic is structurally sound.

Token-Level Round-Robin Collaboration: Method and Theoretical Guarantees

To bypass this structural collapse, Token-Level RR collaboration reframes the aggregation problem. Instead of independently producing full outputs, agents interleave generation within a shared context, each writing a fixed chunk of $K$ tokens in turn. At each step, the agent observes the current context $h_t$ (tokens so far), then appends its contribution, handing off to the next agent. This process iterates until a complete output is produced.

From a dynamical systems perspective, the paper models LLM generation as a trajectory through a high-dimensional latent semantic space. Honest agents act as contraction operators pulling the context toward a "truth direction," parameterized by the model’s spectral gap $\gamma_H$ . Adversarial agents can only push the sequence off-course by a sycophancy-bounded perturbation, whose magnitude is tightly constrained by the context’s current state—corrupted generations inheriting enough truthful prefix are forced, due to perplexity pressure, to continue the correct chain of logic.

The paper proves a Lyapunov stability theorem: for sufficiently small $K$ (i.e., frequent interventions), the restorative pull of honest agents dominates adversarial drift, even when corrupted agents are a numerical majority. The critical corruption threshold $\rho_{\text{max}}$ depends on the ratio of contraction and perturbation rates—for realistic parameterizations, systems can formally tolerate $\rho > 0.5$ . Numerical examples (e.g., with $\gamma_H=0.03$ , $\delta=0.004V$ , $K=100$ ) demonstrate resilience up to 66% corrupted agents.

Empirical Validation: Multi-Agent Robustness under Adversarial Corruption

Extensive experiments on a variety of LLMs (Llama-3.3-70B, Llama-4-Scout-17B, Mistral-3.1-24B, Qwen2.5-32B, Qwen3-30B), spanning mathematical, logical, and causal reasoning benchmarks, empirically validate the theoretical guarantees. The evaluation injects both persuasive and imperative (hard) adversarial payloads to a controllable subset of agents.

Key empirical findings include:

Collapse of MAJ: Once the majority is corrupted (e.g., 3c2t ensembles), MAJ accuracy on benchmarks like Track7 and GSM8K drops precipitously (examples: $0.8\%$ for Llama-3.3-70B Track7, $h_t$ 0 for Llama-70B GSM8K under imperative injection).
Resilience of RR: RR collaboration orchestrates substantial recoveries, reaching up to $h_t$ 1 absolute gain in the most challenging settings.
Cost-Neutrality: RR maintains computational cost equivalence to MAJ, as total decoding steps are unchanged—the only overhead is lightweight, parallelizable pre-fill.
Scalability: Increasing the number of RR trajectories enhances robustness due to the crossing of the accuracy probability threshold, while increasing MAJ samples exacerbates the failure due to the Condorcet trap.
Ablations: Recovery is robust across chunk sizes, but too small ( $h_t$ 2) fragments logic, while too large ( $h_t$ 3) gives adversaries semantic runway.

Mechanistic Insights and Asymmetric Rescue

Analysis of log traces and heterogeneous ensembles reveals further properties:

Final Speaker Irrelevance: Accuracy does not depend on whether the final token chunk is contributed by an honest or corrupted agent; the shared context constrains the final answer.
Asymmetric Rescue: Even weak honest agents can "rescue" the sequence when collaborating with strong corrupted agents, while the reverse effect is negligible. This underlines the algorithmic dominance of logic consistency over adversarial persistence, provided the honest agent can inject verifiable constraints.

Practical Implications and Future Directions

The formal and empirical results provided in this paper illuminate a fundamental limitation of current multi-agent LLM deployment in open environments: response-level consensus mechanisms break under correlated adversarial pressure. Token-level interleaving converts aggregation from a fragile arithmetic process into a dynamically interwoven logical system, leveraging the depth of honest models to neutralize shallow manipulations. This insight has significant implications for robust LLM deployment in high-stakes applications exposed to strategic adversarial activity (e.g., financial planning, IT security, or regulated content moderation).

Future research directions highlighted include:

Adaptive Adversaries: Studying adversaries with the ability to dynamically react to shared context changes.
Dynamic Chunking: Optimizing the $h_t$ 4 parameter for application-specific cost-robustness trade-offs.
Integration with Modular Scaffolding: Combining RR with expert retrieval and domain specialization, as in recent collaborative and modular LLM decoding literature.

Conclusion

Token-Level Round-Robin collaboration provides a mathematically and empirically validated solution to the consensus trap in multi-agent LLM ensembles. By enabling mid-sequence correction and tightly constraining adversarial drift through shared logical context, RR achieves super-majority resilience unattainable via response-level aggregation. These findings advocate for a paradigm shift in LLM ensemble design and open pathways for robust, scalable deployment of AI systems in adversarial, real-world environments.

References

For complete references to all cited works and supporting literature, see (2604.17139).

Markdown Report Issue