Choosing between MCTS and expansive tree-based sampling in MALT

Determine whether Monte Carlo Tree Search (MCTS) or expansive tree-based sampling should be used as the trajectory generation method within the MALT multi-agent LLM training pipeline, and ascertain the conditions under which each approach provides superior effectiveness and efficiency for synthetic data generation and search in the generator–verifier–refiner sequence.

Background

MALT generates synthetic data by expanding a tree of reasoning trajectories using a branching factor across three heterogeneous agents (generator, verifier, and refinement model). Values are propagated from leaf nodes (final answers) back through intermediate nodes to label branches for preference training. The authors currently employ an expansive tree-based sampling strategy due to limited depth and offline data collection.

In the discussion of design choices, the authors note that alternative search strategies such as Monte Carlo Tree Search (MCTS) may offer different trade-offs in exploration, computational cost, and sample efficiency. They explicitly state that deciding between MCTS and expansive sampling remains unresolved, highlighting the need for a principled criterion to select the search method for trajectory generation in multi-agent LLM training.

References

We also leave the choice between MCTS and an expansive tree-based sampling strategy as an open problem.

MALT: Improving Reasoning with Multi-Agent LLM Training (2412.01928 - Motwani et al., 2 Dec 2024) in Section 6 (Discussion)