- The paper introduces MARS², a unified multi-agent RL framework that scales tree search to overcome exploration limits in code generation.
- The paper leverages adaptive Thompson sampling and hierarchical reward shaping to enhance both exploration diversity and training stability.
- The paper demonstrates robust performance gains, with up to +8.0 Pass@1 improvement on benchmarks compared to single-agent baselines.
MARS2: Multi-Agent Reinforced Tree Search for Code Generation
Motivation and Framework
The paper introduces MARS2 (Multi-Agent Reinforced Tree Search Scaling), a unified reinforcement learning (RL) framework designed to address two critical bottlenecks in RL-based code generation: (1) the exploration limitations imposed by single-policy training, which lead to local optima and restricted trajectory diversity, and (2) the lack of principled integration between multi-agent collaboration and structured search mechanisms. Traditional approaches using Monte Carlo Tree Search (MCTS)-augmented RL typically employ a single policy to guide the search tree, inherently narrowing the exploration scope as learning progresses. In contrast, prior multi-agent RL (MARL) methods for LLM reasoning often decouple agent interactions from structured, informed search processes, relying primarily on non-hierarchical or dialogue-based coordination.
MARS2 proposes a structured tree-search environment shared among multiple independently-optimized policies. Each agent’s trajectory propagates through the search tree via agent-responsive Thompson sampling over agent--node pairs. Reward assignment is performed with a novel path-level, tree-consistent reward shaping mechanism that distributes credit not only at the global tree level but also along hierarchical (parent, child, and sibling) relationships, encouraging both vertical solution refinement and horizontal diversity.
Figure 1: Overview of the MARS2 framework, showing multi-agent expansion of a shared search tree and hierarchical reward shaping.
Methodological Contributions
The core methodological ingredients of MARS2 are:
- Collaborative Multi-Agent Tree Search: Agents cooperate in a shared, learnable tree-structured environment. Tree expansion is controlled by Thompson sampling applied both to agent and node selection among the eligible set, allowing for adaptive exploration that reflects the strengths of heterogeneous agents.
- Path-Level Group Advantage and Hierarchical Reward Shaping: The reward for each node is adjusted by a mixed baseline derived from the parent’s reward and the average reward of sibling nodes, with a tunable parameter λ determining the vertical and horizontal contribution. The agent’s training objective extends the classic Group Relative Policy Optimization (GRPO) advantage estimator with hierarchical credit signals, increasing stability and effectiveness of policy optimization.
- Independent but Coordinated Optimization: Each agent’s policy parameters are optimized using rewards from nodes it generates and refines. The sampling and update protocol is designed to avoid sample imbalance, even under stochastic rollout allocation, facilitated by an asynchronous buffer-based update mechanism.
Empirical Evaluation on Code Generation
MARS2 is evaluated on LiveCodeBench v6, using open-source code-centric and general LLMs of 8B and 14B parameters, covering both homogeneous and heterogeneous agent ensembles. Baselines include:
- Vanilla GRPO (standard RL policy optimization without structured tree search)
- RS2 (single-agent reinforced tree search with tree-level credit assignment but no multi-agent diversity)
- Various combinations of MARS2 (with paired agents and with an intentionally weaker agent to test robustness).
All methods are trained and evaluated under controlled data and compute budgets, using the same MCTS-based inference protocol with a fixed node budget.
Results demonstrate:
- Across all models and scales, MARS2 produces strong absolute Pass@1 gains (up to +8.0 points) over base models, exceeding both GRPO and RS20 by several points.
- System-level collaboration (ensemble inference) sees robust improvements in both Pass@1(MCTS) and Pass@N, indicating enhanced exploration and solution diversity not attainable with single-agent counterparts.
- Single-agent RS21 slightly lags behind MARS22, plateauing as its search-based trajectories become increasingly concentrated (exploration saturation).



Figure 2: Pass@1 accuracy over training steps for MARS23 training on Qwen-8B + AReaL-8B and Qwen-14B + AReaL-14B model pairs.
Robustness to Agent Heterogeneity and Diversity Analysis
Ablation studies probe the effect of injecting a substantially weaker agent (DeepCoder-14B) into a 14B-scale ensemble. The addition of a weak trajectory source reduces the individual Pass@1 gains for strong agents, but system-level performance (Pass@1(MCTS), Pass@N) remains robust, indicating MARS24's capacity to aggregate complementary exploratory signals and benefit from agent diversity despite skill imbalance.
Figure 3: Impact of adding a weaker agent (DeepCoder-14B) into a 14B-scale MARS25 ensemble.
Diversity-centric evaluation uses a comprehensive suite of metrics (e.g., Average Embedding Clusters, DA@K, Effective Algorithms, G-Vendi) to measure solution space coverage and structural/algorithmic diversity. MARS26 consistently leads across most diversity measures, supporting the claim that its performance gains stem from richer, more effective exploration rather than repeated exploitation of high-reward trajectories.
Role of Reward Shaping and Training Stability
MARS27's tree-consistent reward shaping is empirically shown to substantially enhance training stability and accelerate convergence. Without reward shaping, models exhibit delayed and less stable improvements, whereas structured credit propagation yields dense and aligned signal flow, improving both convergence rates and final performance. Sensitivity analysis of the mixing parameter 28 confirms an optimal regime where both vertical and horizontal signals are balanced.
Figure 4: Reward shaping in MARS29 stabilizes training and yields consistently superior Pass@1 curves.
Generalization to Other Reasoning Tasks
MARS20's core innovations are not domain-specific; single-agent RS21 transfers successfully to mathematical reasoning benchmarks (e.g., MATH dataset), yielding consistent Pass@1(MCTS) improvements over both base and GRPO models, with analogous properties observed in exploration gain and reward scaling.
Implications, Limitations, and Future Directions
MARS22 advances the methodology of RL-based reasoning in LLMs by integrating multi-agent exploration and structured credit assignment within a unified tree search environment. This approach directly addresses key limitations of single-policy and unstructured multi-agent RL, offering:
- Scalable mechanisms for improving solution diversity and search depth
- Robustness to agent heterogeneity in practical settings where ensembles may be asymmetric
- Improved training stability and convergence dynamics without reliance on increased data or compute budgets
However, the sequential and agent-interleaved nature of the collaborative tree search induces overhead in rollout parallelism, potentially increasing wall-clock training time compared to flat, fully parallel single-agent methods. While this is inherent to the approach's structured exploration design, future work may focus on further efficiency improvements or asynchronous/parallelized MCTS variants without loss of collaborative fidelity.
Theoretically, the framework generalizes to other settings where hierarchical decomposition of credit and diverse policy biases are both possible and beneficial. Extensions to broader classes of reasoning tasks, richer agent architectures, and harder search environments (including adversarial or constrained optimization regimes) are immediate directions.
Conclusion
MARS23 provides a formal, empirically validated pathway for coupling structured multi-agent search with reinforcement learning in code generation tasks. Experimental results on diverse LLMs substantiate its superiority over both single-agent search-augmented RL and previous MARL approaches. The integration of path-level group advantage and tree-consistent shaping marks a significant methodological development in the design of scalable, robust, and diverse reasoning systems. The framework's promise extends to practical augmentation of complex code and reasoning AI, with general applicability to a wide range of structured, multi-step decision-making problems.