Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

Published 2 Apr 2026 in cs.CL and cs.MA | (2604.02460v1)

Abstract: Recent work reports strong performance from multi-agent LLM systems (MAS), but these gains are often confounded by increased test-time computation. When computation is normalized, single-agent systems (SAS) can match or outperform MAS, yet the theoretical basis and evaluation methodology behind this comparison remain unclear. We present an information-theoretic argument, grounded in the Data Processing Inequality, suggesting that under a fixed reasoning-token budget and with perfect context utilization, single-agent systems are more information-efficient. This perspective further predicts that multi-agent systems become competitive when a single agent's effective context utilization is degraded, or when more compute is expended. We test these predictions in a controlled empirical study across three model families (Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5), comparing SAS with multiple MAS architectures under matched budgets. We find that SAS consistently match or outperform MAS on multi-hop reasoning tasks when reasoning tokens are held constant. Beyond aggregate performance, we conduct a detailed diagnostic analysis of system behavior and evaluation methodology. We identify significant artifacts in API-based budget control (particularly in Gemini 2.5) and in standard benchmarks, both of which can inflate apparent gains from MAS. Overall, our results suggest that, for multi-hop reasoning tasks, many reported advantages of multi-agent systems are better explained by unaccounted computation and context effects rather than inherent architectural benefits, and highlight the importance of understanding and explicitly controlling the trade-offs between compute, context, and coordination in agentic systems.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that under equal token budgets, Single-Agent LLMs outperform Multi-Agent systems by leveraging full global context.
It employs an information-theoretic framework and rigorous empirical evaluation across models and benchmarks like FRAMES and MuSiQue.
Under severe context degradation, Multi-Agent systems can reclaim competitiveness, emphasizing the trade-off between architectural design and context efficiency.

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

Problem Formulation and Theoretical Framework

The paper investigates the comparative efficacy of single-agent systems (SAS) versus multi-agent systems (MAS) for multi-hop reasoning with LLMs, under the constraint of equal "thinking token" budgets. The analysis is motivated by the observation that many empirical MAS improvements are potentially confounded by increased compute at test time, making it unclear whether observed performance gains are truly architectural or simply the byproduct of greater resource allocation.

Using an information-theoretic argument anchored in the Data Processing Inequality (DPI), the authors show that, under a fixed reasoning-token budget, a single-agent system with full access to the global context is theoretically optimal in terms of information utilization. Specifically, for a given reasoning process, the message-passing bottlenecks in MAS architectures necessarily introduce information loss, with the result that $I(Y;C) \ge I(Y;M)$ . Fano’s inequality makes this argument tangible for prediction error: the minimal achievable error based on MAS intermediates is lower bounded by that using a full SAS trajectory.

However, when the effective context for a single agent is degraded (via deletion, masking, substitution, or distractors), MAS can partially recover task-relevant information by structuring computation and filtering out noise, potentially allowing MAS to match or exceed SAS in severely corrupted scenarios.

Experimental Methodology

The empirical section encompasses models from three major families: Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5. The main evaluation benchmarks are FRAMES and MuSiQue (restricted to 4-hop questions), focusing on multi-hop world knowledge questions with concise reference answers.

A meticulous protocol ensures matched reasoning-token budgets for all systems. SAS is implemented as a unified forward trajectory; a variant (SAS-L) encourages more "visible" internal reasoning without increasing token budget. Five MAS variants are examined: Sequential, Subtask-parallel, Parallel-roles, Debate, and Ensemble, controlling for planning and aggregation overhead. The primary MAS baseline is Sequential, which structurally mimics SAS but with explicit decomposition and stepwise message-passing.

The evaluation metric uses an LLM-as-judge rubric: outputs are scored for semantic equivalence to reference answers by a dedicated model, controlling for formatting and paraphrasing.

Main Results: SAS vs. MAS Under Matched Budgets

Across all families, SAS matches or outperforms all MAS variants when computation is tightly controlled. In no regime (except severe context degradation) does MAS realize a generalized architectural advantage over the SAS baseline under equal token budgets. On Gemini models, the SAS-L variant can further close gaps previously attributed to MAS, particularly by unlocking underutilized internal reasoning channels.

Figure 2: MuSiQue 4-hop answer accuracy for multiple Gemini model versions (unlimited thinking tokens): SAS is always at least competitive with Sequential MAS, with both improving with scaling.

Detailed ablations with Gemini model versions (Figure 1) show the SAS is consistently competitive or superior to Sequential MAS in every API release, with accuracy scaling monotonically with model quality but not with MAS adoption.

In detection of API artifacts, the paper reveals that visible reasoning with SAS is often truncated by the underlying serving infrastructure, and that thought token accounting using Gemini APIs can be highly unreliable. This has critical implications for reproducibility and fidelity in compute-bound comparisons.

Context Degradation and Robustness

Under controlled context degradation, the theoretical predictions are experimentally borne out: as the effective context available to the SAS is intentionally corrupted, MAS approaches (particularly those with greater diversity or verification, such as Debate) become competitive and, under heavy masking or substitution, clearly surpass SAS.

Figure 1: Context degradation (masking, substitution, deletion, distractors) on MuSiQue 4-hop with Qwen3-30B-A3B at 1000-token budget—SAS dominates at low degradation but MAS overtakes at highest corruption levels.

The transition point—the α or k value for which MAS outperforms—depends strongly on the corruption method. For deletions and distractors, SAS maintains its edge longer; for masking/substitution (where misleading content is injected), MAS gains a clear advantage at high degradation, validating the DPI-based theoretical argument.

Additional analyses show MAS architectures do not yield inherent gains even with extended compute or compositional diversity; any improvements track directly to expanded resource usage or more elaborate context filtering—not to any fundamental MAS design property.

Implications and Future Directions

The core result is a direct challenge to the prevailing assumption that multi-agent orchestration is inherently beneficial for complex LLM reasoning under compute-equated conditions. Across settings, the majority of MAS performance improvements in the literature are accounted for by unbalanced compute and context utilization, not architectural invention.

From a theoretical perspective, this shifts the focus to optimizing context and compute allocation within SAS, as orchestration and explicit communication in MAS architectures can only compensate when single-agent access to task-relevant context is severely bottlenecked by context degradation. MAS should be reserved for scenarios where attention and memory are provably the limiting factors, or where robust distributed filtering is required.

Practically, these findings advise greater care in interpreting MAS benchmarks for real-world deployment. For compute-constrained or latency-critical settings, optimal allocation is to maximize intra-agent reasoning depth rather than architecturally fragmenting the problem. For hybrid or degraded contexts, future work should investigate adaptive/hierarchical approaches that blend SAS and MAS strengths.

Experimentally, the findings highlight the necessity of precise token accounting and robust context auditing, particularly for API-mediated models where visible computation and internal inference may diverge.

Conclusion

The paper provides a comprehensive, rigorously controlled comparison of SAS and MAS settings for multi-hop LLM reasoning under normalized thinking token budgets. Theoretical and empirical evidence robustly supports the claim that SAS is always at least as competitive as any MAS analog in the computation-matched setting, except in the regime of extreme context degradation. These results call for a re-examination of the comparative regimes in which multi-agent orchestration is justified, and underscore the priority of resource and context efficiency over architectural proliferation in next-generation LLM system design.

Markdown Report Issue