- The paper introduces AgentDropout, which dynamically optimizes multi-agent LLM collaboration by eliminating redundant nodes and edges.
- It employs a two-stage strategy using trainable adjacency matrices and policy gradients to balance task performance with token efficiency.
- Experiments on reasoning, math, and code generation tasks show notable gains, achieving up to 21.6% reduction in token usage and improved performance.
LLM-based Multi-Agent Systems (MAS) frequently encounter challenges related to communication overhead and suboptimal task performance. Redundancy in communication, both in terms of unnecessary information exchange (edges) and the participation of non-critical agents (nodes) at certain stages, contributes to these issues. The paper introduces AgentDropout, a methodology designed to dynamically optimize the communication topology of MAS by identifying and eliminating redundant agents and communication links across different rounds (2503.18891). This approach draws inspiration from management theory, where team roles are often adjusted dynamically for efficiency.
Methodology: AgentDropout
AgentDropout employs a two-stage optimization process to learn a sparse, effective communication graph represented by weighted adjacency matrices. The goal is to maximize task performance while minimizing token consumption through the elimination of less relevant agents and connections.
Node Dropout
The initial phase focuses on identifying and removing agents (nodes) whose contributions are minimal within specific communication rounds.
- Graph Representation: The communication structure is modeled as a weighted graph with trainable intra-round ($\tilde{\mathcal{A}_{\text{intra}}$) and inter-round ($\tilde{\mathcal{A}_{\text{inter}}$) adjacency matrices. Initial weights are typically set uniformly (e.g., 0.5).
- Optimization for Performance: The intra-round matrices $\tilde{\mathcal{A}_{\text{intra}}$ are trained to maximize the expected task performance μ(G), where G is the communication graph sampled based on the matrices. Since performance metrics (e.g., accuracy on benchmarks) are often non-differentiable with respect to the graph structure, an unbiased policy gradient estimator is utilized for optimization:
∇θJ(θ)=EG∼Pθ(G)[∇θlogPθ(G)(μ(G)−b)]
Here, θ represents the parameters of the adjacency matrices, Pθ(G) is the probability of sampling graph G, μ(G) is the performance on graph G, and b is a baseline to reduce variance.
- Node Identification: After training $\tilde{\mathcal{A}_{\text{intra}}$, the weighted in-degree din(v,t) and out-degree dout(v,t) are calculated for each node v in each round t. The total degree d(v,t)=din(v,t)+dout(v,t) serves as an indicator of the node's importance in that round.
- Node Elimination: Nodes are ranked based on their total degree d(v,t) within each round. A fixed proportion α of nodes with the lowest degrees are designated as dropout nodes for their respective rounds. These nodes, along with their incident edges, are removed, resulting in updated adjacency matrices. The selection criterion is:
DropNode(v,t)⟺rank(d(v,t))≤α×N
where N is the total number of agents.
Edge Dropout
Following Node Dropout, the second phase targets the removal of redundant communication links (edges).
- Re-initialization and Training: The adjacency matrices ($\tilde{\mathcal{A}_{\text{intra}}$ and $\tilde{\mathcal{A}_{\text{inter}}$), potentially modified by Node Dropout, are re-initialized and trained again from scratch.
- Optimization for Performance and Sparsity: The optimization objective now incorporates both task performance μ(G) and communication efficiency. Efficiency is promoted by adding a low-rank sparsity regularization term to the objective function. The objective becomes:
$\max_{\theta} \mathbb{E}_{G \sim P_{\theta}(G)}[\mu(G)] - \lambda \sum_{A \in \{\tilde{\mathcal{A}_{\text{intra}}, \tilde{\mathcal{A}_{\text{inter}}}\}} \text{rank}(A)$
where λ is a hyperparameter balancing performance and sparsity. The rank function, being NP-hard to optimize directly, is approximated using the nuclear norm ∣∣A∣∣∗, which serves as a convex relaxation:
$\max_{\theta} \mathbb{E}_{G \sim P_{\theta}(G)}[\mu(G)] - \lambda \sum_{A \in \{\tilde{\mathcal{A}_{\text{intra}}, \tilde{\mathcal{A}_{\text{inter}}}\}} ||A||_*$
The performance term EG∼Pθ(G)[μ(G)] is optimized using policy gradients as before.
- Edge Identification and Elimination: After training, edges corresponding to the lowest weights in the optimized matrices $\tilde{\mathcal{A}_{\text{intra}}$ and $\tilde{\mathcal{A}_{\text{inter}}$ are pruned. A proportion β of edges with the smallest weights are removed. The criterion for edge (u,v) at round t represented by weight wuv,t is:
DropEdge(u,v,t)⟺rank(wuv,t)≤β×M
where M is the total number of potential edges.
- Final Graph Sampling: The resulting doubly-pruned weighted adjacency matrices define the probability distribution for sampling the final communication graph G^ during inference using the
DAGSample
algorithm. This algorithm ensures the sampled graph is a Directed Acyclic Graph (DAG), preventing cyclical dependencies.
Experimental Results
AgentDropout was evaluated on reasoning (MMLU), mathematics (GSM8K, AQuA, MultiArith, SVAMP), and code generation (HumanEval) tasks using Llama3-8B, Qwen2.5-72B, and Deepseek-V3-671B as base LLMs.
- Task Performance: AgentDropout demonstrated consistent performance improvements over baselines including single LLM inference, Chain-of-Thought (CoT), standard multi-round MAS (MAS_T), and AgentPrune (SOTA edge pruning method). With Llama3-8B, AgentDropout achieved an average performance gain of 1.14 points across benchmarks compared to AgentPrune. The method also improved performance stability, particularly with the smaller Llama3-8B model.
- Token Consumption: Significant reductions in both prompt and completion tokens were observed. Compared to AgentPrune, AgentDropout achieved average reductions of 21.6% in prompt tokens and 18.4% in completion tokens. Specific figures for Llama3-8B (averaged across tasks) show AgentDropout using 3.3M prompt tokens and 839K completion tokens, compared to AgentPrune's 4.2M and 1.0M, respectively. Similar reductions were observed for larger models.
Robustness and Transferability
- Structure Robustness: The effectiveness of AgentDropout was shown to be robust to variations in the initial communication graph structure (e.g., fully connected, layered, random). Optimized topologies derived from different initial structures yielded comparable performance and efficiency gains.
- Domain Transferability: The communication topology learned by AgentDropout on one dataset (e.g., AQuA) exhibited strong transfer performance when applied to other datasets within the same domain (e.g., GSM8K, MultiArith, SVAMP). This suggests the learned pruning strategies capture generalizable collaborative patterns relevant to the task type (mathematical reasoning), reducing the need for extensive tuning on every new dataset.
- Ablation Studies: Ablations confirmed the necessity of both Node Dropout and Edge Dropout stages. Applying only one stage resulted in inferior performance or efficiency compared to the full AgentDropout method. Furthermore, the learned dropout strategy significantly outperformed random node/edge dropout, validating the effectiveness of the optimization process.
In conclusion, AgentDropout presents a novel approach for optimizing LLM-based MAS by dynamically eliminating both redundant agents and communication links based on learned contributions across different stages of problem-solving. The method yields substantial improvements in token efficiency and notable gains in task performance, demonstrating robustness and transferability across tasks and initial structures. This technique offers a practical way to enhance the feasibility and effectiveness of multi-agent collaboration using LLMs.