Papers
Topics
Authors
Recent
2000 character limit reached

Collective Monte Carlo Tree Search (CoMCTS)

Updated 2 January 2026
  • Collective MCTS (CoMCTS) is an advanced method that extends traditional MCTS by aggregating contributions from multiple agents, models, or belief states to enable cooperative planning.
  • It integrates collective expansion, simulation, and backpropagation by combining diverse evaluations and policies, enhancing exploration and decision-making efficiency.
  • CoMCTS has been successfully applied in multimodal LLM reasoning, multi-robot planning, and cooperative pathfinding, demonstrating state-of-the-art performance under uncertainty.

Collective Monte Carlo Tree Search (CoMCTS) denotes a family of extensions to classical Monte Carlo Tree Search (MCTS) that enable collaborative reasoning, planning, or search over a combinatorial space by leveraging information, policies, or value estimates from multiple agents, models, or belief states. While the term "collective" can index heterogeneous mechanisms (e.g., multi-agent planning, belief aggregation, multimodal model pooling), the unifying feature is that the tree search—its expansions, evaluations, or updates—explicitly aggregates contributions from several sources, contrasting with the single-agent, single-model nature of conventional MCTS. CoMCTS has achieved state-of-the-art performance in diverse domains, including multi-model stepwise reasoning for large language and multimodal models, multi-robot coverage path planning, cooperative trajectory planning in uncertain environments, and multi-agent pathfinding (Yao et al., 2024, Kurzer et al., 2018, Stegmaier et al., 2022, Hyatt et al., 2020, Pitanov et al., 2023).

CoMCTS generalizes the classic four-phase MCTS loop—Selection, Expansion, Simulation, and Backpropagation—by enabling each search operation to incorporate input from multiple agents, models, or sampled scenarios. The canonical workflow is characterized by:

  • Collective Expansion: At every expansion step, candidate actions, reasoning steps, or trajectories are generated by multiple policies or models, fostering diversity and broader search.
  • Collective Simulation/Assessment: Rollouts or evaluations are performed in parallel or as aggregates, allowing errors to be located and branches to be pruned collectively.
  • Collective Backpropagation: Node statistics (visit counts, value estimates) are updated using evaluations aggregated across agents, models, or scenarios.
  • Collective Selection: Tree traversal uses value or uncertainty statistics reflecting the composite evidence of the group, enabling robust exploration/exploitation.

The structural form of the aggregation can be nontrivial—ranging from averaging correctness probabilities across models (Yao et al., 2024) to constructing empirical return distributions from belief samples (Stegmaier et al., 2022), to decentralized UCT updates in multi-agent Dec-MDPs (Kurzer et al., 2018).

2. Mathematical Formalism and Algorithms

While MCTS operates on a single agent's policy π\pi and value function VV, CoMCTS introduces a set of KK policies or agents {π1,,πK}\{\pi_1, \dots, \pi_K\}, or a finite ensemble of sampled root states in belief-space variants. The generic formalism is as follows:

For model-pooling CoMCTS (e.g., MLLM reasoning (Yao et al., 2024)):

  • Each node ss in the reasoning tree stores:

    • Visit count N(s)N(s),
    • Estimated value V(s)V(s),
    • Reward function based on model pooling:

    R(s)=1Kk=1Kπk(“Is this step correct?”Q,  path to s)R(s) = \frac{1}{K} \sum_{k=1}^K \pi_k\bigl(\texttt{“Is this step correct?”} \mid Q, \;\text{path to }s\bigr)

  • UCB selection:

    UCB(s)=V(s)+clnN(s^)1+N(s)\mathrm{UCB}(s) = V(s) + c\sqrt{\frac{\ln N(\widehat s)}{1+N(s)}}

For multi-agent, decentralized settings (Kurzer et al., 2018):

  • Each agent ii maintains Qi(s,ai)Q^i(s,a^i) and N(s,ai)N(s,a^i), while all agents share the same reward and transition model.
  • Selection at node ss for agent ii:

    ai=argmaxaAi(s)[Qi(s,a)+ClnN(s)N(s,a)]a^i = \arg\max_{a\in A^i(s)} \left[ Q^i(s,a) + C\sqrt{\frac{\ln N(s)}{N(s,a)}} \right]

  • Cooperative reward can be parametrized by cooperation factor λi\lambda^i.

For belief-space/planning under uncertainty (Stegmaier et al., 2022):

  • A root belief B0=N(μ,Σ)\mathcal{B}_0=\mathcal{N}(\mu,\Sigma) over state.
  • kk start states sampled from belief, each spawns a separate tree.
  • Final action selection fuses the per-tree return distributions using kernel regression.

Algorithmic variants differ in their strategies for collective action selection, reward construction, progressive widening, rollouts, kernel-based updates, and risk-aware selection metrics.

3. Major Domains and Mechanisms

Multimodal LLM Reasoning

CoMCTS is central to Mulberry MLLMs (Yao et al., 2024), where KK heterogeneous policy models are used to expand candidate reasoning chains in parallel, jointly judge substep correctness, and prune suboptimal paths. This leads to efficient and diverse exploration of reasoning routes. Each step is scored by the average model probability of correctness, and backpropagation aggregates these scores up the tree. Selection leverages the standard UCB criterion, descended collectively. This procedure enabled construction of the Mulberry-260k dataset and subsequent supervised fine-tuning (CoSFT), leading to empirical gains over both direct prompting and single-model MCTS baselines.

Multi-Agent and Multi-Robot Planning

In cooperative robotics (Hyatt et al., 2020), each robot executes its own MCTS instance but simulates other robots along their latest best trajectories, enabling distributed yet coordinated planning. The joint effects are realized by factoring in other agents' predicted moves during rollouts, without explicit enumeration of the joint action space. Branching is controlled by per-agent decision trees, and performance matches or exceeds established approaches under various objectives.

In cooperative pathfinding (Pitanov et al., 2023), a depth-nn decomposition over agents in the MCTS tree drastically reduces branching (AnA|A|^n \to |A|, per-level, for nn agents), while subgoal rewards and joint action simulation yield improved cooperative completion rates compared to baseline A* or naive joint MCTS.

Cooperative Planning under Uncertainty

CoMCTS for trajectory planning in uncertain environments (Stegmaier et al., 2022) relies on sampling root states from a belief distribution, then growing parallel MCTS trees from these roots. Kernel regression fuses action return estimates across trees into continuous empirical distributions, which are then scored by risk-sensitive functionals (KRLCB, CVaR). This method improves robustness and safety in automated vehicle planning under sensor and intent uncertainty.

Decentralized Model-based Planning

In decentralized settings (Kurzer et al., 2018), CoMCTS (with Decoupled-UCT) enables each agent to locally optimize over actions using its marginal value estimates, but global outcomes reflect the interdependent evolution from joint rollouts and shared rewards. Practical enhancements include progressive widening in continuous spaces and kernel-based value smoothing.

4. Representative Algorithms and Pseudocode Structures

Across applications, the CoMCTS algorithm retains the core four-phase loop, but augments Expansion, Simulation, and Backpropagation with aggregation logic. The following table provides a stylized comparison of fundamental algorithmic steps in selected CoMCTS instantiations:

Application Domain Aggregation Mechanism Expansion/Simulation Strategy
MLLM Reasoning (Yao et al., 2024) Average correctness over KK policies Parallel expansion, pruning by mean score
Multi-Robot Coverage (Hyatt et al., 2020) Replay best paths of other agents Per-agent trees, rollout with others' latest
Traj. Planning Uncertainty (Stegmaier et al., 2022) Fusion via kernel regression over trees Multiple trees from belief, risk-based selection
Decentralized Vehicles (Kurzer et al., 2018) Decoupled UCT, shared reward Progressive widening, local grouping

Algorithmic pseudocode examples are presented in (Yao et al., 2024, Hyatt et al., 2020, Stegmaier et al., 2022, Kurzer et al., 2018), consistently structuring collective decision-making at expansion and evaluation phases.

5. Empirical Performance and Evaluation

CoMCTS consistently outperforms or matches baseline methods in key domains, notably achieving:

  • Multimodal Reasoning (Yao et al., 2024): On benchmarks such as MathVista and MMMU, the Mulberry models using CoMCTS saw stepwise improvements (e.g., +4.2 pp and +7.5 pp lead over base models) and the highest search success rates (e.g., 80.2% versus 58.2–66.2% for baselines), with reduced average iterations.
  • Multi-Robot Coverage (Hyatt et al., 2020): Comparable or better completion times versus Boustrophedon planners across varying team sizes, with Pareto-efficient tradeoffs (coverage time vs turn minimization).
  • Uncertain Trajectory Planning (Stegmaier et al., 2022): Robust success rates near 100% in simple and complex scenarios when employing risk-sensitive final selection, even under noisy sensor conditions (baseline performance drops significantly without collective handling).
  • Multi-Agent Pathfinding (Pitanov et al., 2023): Subgoal-based CoMCTS achieves high agent and episode success rates (e.g., ISR $1.00$ for $4$ agents, $0.90$ for $16$); computation time for 16 agents is ~4.2s per move versus 12.1s for naive joint MCTS.

6. Strengths, Limitations, and Future Directions

CoMCTS strengths include:

  • Diversity and Robustness: Pooling across models or agents mitigates local minima and single-agent biases, increasing solution diversity and error correction (Yao et al., 2024).
  • Efficiency: Collective pruning and value-sharing concentrate computational effort on promising branches (Yao et al., 2024, Stegmaier et al., 2022).
  • Reflective/Corrective Learning: Negative sibling nodes or risk-aware selection engender reasoning with self-correction and robustness.
  • Scalable Extension to Uncertainty: Belief-based variants directly incorporate uncertainty quantification and risk metrics (Stegmaier et al., 2022).

Limitations and open problems include:

  • Computational Demands: Running KK models or maintaining multiple search trees increases hardware and latency requirements (Yao et al., 2024, Stegmaier et al., 2022).
  • Dependence on Model Quality: If the ensemble is homogeneously weak or biased, collective aggregation does not improve solution quality (Yao et al., 2024).
  • Handling Heterogeneous or Dynamic Ensembles: Most forms use fixed KK; adapting to dynamically varying model pools or agent numbers requires further study (Yao et al., 2024).
  • Safety Guarantees: CoMCTS does not by itself ensure formal guarantees of safety, especially critical in autonomous systems (Kurzer et al., 2018).
  • Hyperparameter Sensitivity: Kernel bandwidths, progressive widening exponents, and risk-aversion coefficients require domain-specific tuning (Stegmaier et al., 2022, Kurzer et al., 2018).

Active directions include integrating symbolic experts or hybrid policies, dynamic model selection, further multimodal generalization, and reinforcement learning policy distillation on CoMCTS trajectories (Yao et al., 2024).

While the term "CoMCTS" is not universally applied, cooperative or collective versions of MCTS—under alternate names such as MAMCTS, Decoupled-UCT, or belief-ensemble MCTS—have independently emerged in combinatorial search, robotics, multi-agent planning, and learning-to-reason research (Yao et al., 2024, Hyatt et al., 2020, Pitanov et al., 2023, Stegmaier et al., 2022, Kurzer et al., 2018). Shared characteristics include decomposition of action selection for branching control, aggregation of values/statistics, and variable coupling of agent policies. CoMCTS is distinguished from classical central or purely decentralized tree search by explicit, formal aggregation of information across models, agents, or scenarios at every critical phase of the search.


References

  • "Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search" (Yao et al., 2024)
  • "A Versatile Multi-Robot Monte Carlo Tree Search Planner for On-Line Coverage Path Planning" (Hyatt et al., 2020)
  • "Monte-Carlo Tree Search for Multi-Agent Pathfinding: Preliminary Results" (Pitanov et al., 2023)
  • "Cooperative Trajectory Planning in Uncertain Environments with Monte Carlo Tree Search and Risk Metrics" (Stegmaier et al., 2022)
  • "Decentralized Cooperative Planning for Automated Vehicles with Continuous Monte Carlo Tree Search" (Kurzer et al., 2018)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Collective Monte Carlo Tree Search (CoMCTS).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube