MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation

Published 16 Apr 2026 in cs.AI and cs.CL | (2604.14564v1)

Abstract: Reinforcement learning (RL) paradigms have demonstrated strong performance on reasoning-intensive tasks such as code generation. However, limited trajectory diversity often leads to diminishing returns, which constrains the achievable performance ceiling. Search-enhanced RL alleviates this issue by introducing structured exploration, which remains constrained by the single-agent policy priors. Meanwhile, leveraging multiple interacting policies can acquire more diverse exploratory signals, but existing approaches are typically decoupled from structured search. We propose \textbf{MARS$^2$} (Multi-Agent Reinforced Tree-Search Scaling), a unified RL framework in which multiple independently-optimized agents collaborate within a shared tree-structured search environment. MARS$^2$ models the search tree as a learnable multi-agent interaction environment, enabling heterogeneous agents to collaboratively generate and refine candidate solutions within a shared search topology. To support effective learning, we introduce a path-level group advantage formulation based on tree-consistent reward shaping, which facilitates effective credit assignment across complex search trajectories. Experiments on code generation benchmarks show that MARS$^2$ consistently improves performance across diverse model combinations and training settings, demonstrating the effectiveness of coupling multi-agent collaboration with tree search for enhancing reinforcement learning. Our code is publicly available at https://github.com/TsinghuaC3I/MARTI.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces MARS², a unified multi-agent RL framework that scales tree search to overcome exploration limits in code generation.
The paper leverages adaptive Thompson sampling and hierarchical reward shaping to enhance both exploration diversity and training stability.
The paper demonstrates robust performance gains, with up to +8.0 Pass@1 improvement on benchmarks compared to single-agent baselines.

MARS $^2$ : Multi-Agent Reinforced Tree Search for Code Generation

Motivation and Framework

The paper introduces MARS $^2$ (Multi-Agent Reinforced Tree Search Scaling), a unified reinforcement learning (RL) framework designed to address two critical bottlenecks in RL-based code generation: (1) the exploration limitations imposed by single-policy training, which lead to local optima and restricted trajectory diversity, and (2) the lack of principled integration between multi-agent collaboration and structured search mechanisms. Traditional approaches using Monte Carlo Tree Search (MCTS)-augmented RL typically employ a single policy to guide the search tree, inherently narrowing the exploration scope as learning progresses. In contrast, prior multi-agent RL (MARL) methods for LLM reasoning often decouple agent interactions from structured, informed search processes, relying primarily on non-hierarchical or dialogue-based coordination.

MARS $^2$ proposes a structured tree-search environment shared among multiple independently-optimized policies. Each agent’s trajectory propagates through the search tree via agent-responsive Thompson sampling over agent--node pairs. Reward assignment is performed with a novel path-level, tree-consistent reward shaping mechanism that distributes credit not only at the global tree level but also along hierarchical (parent, child, and sibling) relationships, encouraging both vertical solution refinement and horizontal diversity.

Figure 1: Overview of the MARS $^2$ framework, showing multi-agent expansion of a shared search tree and hierarchical reward shaping.

Methodological Contributions

The core methodological ingredients of MARS $^2$ are:

Collaborative Multi-Agent Tree Search: Agents cooperate in a shared, learnable tree-structured environment. Tree expansion is controlled by Thompson sampling applied both to agent and node selection among the eligible set, allowing for adaptive exploration that reflects the strengths of heterogeneous agents.
Path-Level Group Advantage and Hierarchical Reward Shaping: The reward for each node is adjusted by a mixed baseline derived from the parent’s reward and the average reward of sibling nodes, with a tunable parameter $\lambda$ determining the vertical and horizontal contribution. The agent’s training objective extends the classic Group Relative Policy Optimization (GRPO) advantage estimator with hierarchical credit signals, increasing stability and effectiveness of policy optimization.
Independent but Coordinated Optimization: Each agent’s policy parameters are optimized using rewards from nodes it generates and refines. The sampling and update protocol is designed to avoid sample imbalance, even under stochastic rollout allocation, facilitated by an asynchronous buffer-based update mechanism.

Empirical Evaluation on Code Generation

MARS $^2$ is evaluated on LiveCodeBench v6, using open-source code-centric and general LLMs of 8B and 14B parameters, covering both homogeneous and heterogeneous agent ensembles. Baselines include:

Vanilla GRPO (standard RL policy optimization without structured tree search)
RS $^2$ (single-agent reinforced tree search with tree-level credit assignment but no multi-agent diversity)
Various combinations of MARS $^2$ (with paired agents and with an intentionally weaker agent to test robustness).

All methods are trained and evaluated under controlled data and compute budgets, using the same MCTS-based inference protocol with a fixed node budget.

Results demonstrate:

Across all models and scales, MARS $^2$ produces strong absolute Pass@1 gains (up to +8.0 points) over base models, exceeding both GRPO and RS $^2$ 0 by several points.
System-level collaboration (ensemble inference) sees robust improvements in both Pass@1(MCTS) and Pass@N, indicating enhanced exploration and solution diversity not attainable with single-agent counterparts.
Single-agent RS $^2$ 1 slightly lags behind MARS $^2$ 2, plateauing as its search-based trajectories become increasingly concentrated (exploration saturation).

Figure 2: Pass@1 accuracy over training steps for MARS $^2$ 3 training on Qwen-8B + AReaL-8B and Qwen-14B + AReaL-14B model pairs.

Robustness to Agent Heterogeneity and Diversity Analysis

Ablation studies probe the effect of injecting a substantially weaker agent (DeepCoder-14B) into a 14B-scale ensemble. The addition of a weak trajectory source reduces the individual Pass@1 gains for strong agents, but system-level performance (Pass@1(MCTS), Pass@N) remains robust, indicating MARS $^2$ 4's capacity to aggregate complementary exploratory signals and benefit from agent diversity despite skill imbalance.

Figure 3: Impact of adding a weaker agent (DeepCoder-14B) into a 14B-scale MARS $^2$ 5 ensemble.

Diversity-centric evaluation uses a comprehensive suite of metrics (e.g., Average Embedding Clusters, DA@K, Effective Algorithms, G-Vendi) to measure solution space coverage and structural/algorithmic diversity. MARS $^2$ 6 consistently leads across most diversity measures, supporting the claim that its performance gains stem from richer, more effective exploration rather than repeated exploitation of high-reward trajectories.

Role of Reward Shaping and Training Stability

MARS $^2$ 7's tree-consistent reward shaping is empirically shown to substantially enhance training stability and accelerate convergence. Without reward shaping, models exhibit delayed and less stable improvements, whereas structured credit propagation yields dense and aligned signal flow, improving both convergence rates and final performance. Sensitivity analysis of the mixing parameter $^2$ 8 confirms an optimal regime where both vertical and horizontal signals are balanced.

Figure 4: Reward shaping in MARS $^2$ 9 stabilizes training and yields consistently superior Pass@1 curves.

Generalization to Other Reasoning Tasks

MARS $^2$ 0's core innovations are not domain-specific; single-agent RS $^2$ 1 transfers successfully to mathematical reasoning benchmarks (e.g., MATH dataset), yielding consistent Pass@1(MCTS) improvements over both base and GRPO models, with analogous properties observed in exploration gain and reward scaling.

Implications, Limitations, and Future Directions

MARS $^2$ 2 advances the methodology of RL-based reasoning in LLMs by integrating multi-agent exploration and structured credit assignment within a unified tree search environment. This approach directly addresses key limitations of single-policy and unstructured multi-agent RL, offering:

Scalable mechanisms for improving solution diversity and search depth
Robustness to agent heterogeneity in practical settings where ensembles may be asymmetric
Improved training stability and convergence dynamics without reliance on increased data or compute budgets

However, the sequential and agent-interleaved nature of the collaborative tree search induces overhead in rollout parallelism, potentially increasing wall-clock training time compared to flat, fully parallel single-agent methods. While this is inherent to the approach's structured exploration design, future work may focus on further efficiency improvements or asynchronous/parallelized MCTS variants without loss of collaborative fidelity.

Theoretically, the framework generalizes to other settings where hierarchical decomposition of credit and diverse policy biases are both possible and beneficial. Extensions to broader classes of reasoning tasks, richer agent architectures, and harder search environments (including adversarial or constrained optimization regimes) are immediate directions.

Conclusion

MARS $^2$ 3 provides a formal, empirically validated pathway for coupling structured multi-agent search with reinforcement learning in code generation tasks. Experimental results on diverse LLMs substantiate its superiority over both single-agent search-augmented RL and previous MARL approaches. The integration of path-level group advantage and tree-consistent shaping marks a significant methodological development in the design of scalable, robust, and diverse reasoning systems. The framework's promise extends to practical augmentation of complex code and reasoning AI, with general applicability to a wide range of structured, multi-step decision-making problems.