Multiple Policy Value MCTS (MPV-MCTS)
- MPV-MCTS is an extension of PV-MCTS that integrates two policy-value networks to jointly enhance search breadth and evaluative accuracy.
- It interleaves rapid simulations from a fast 'small net' with selective high-fidelity evaluations from a 'large net' using a shared tree framework.
- Empirical studies on 9×9 NoGo and AlphaZero self-play show significant Elo gains and a training acceleration of approximately 2× over conventional methods.
Multiple Policy Value Monte Carlo Tree Search (MPV-MCTS) is an extension of policy value Monte Carlo Tree Search (PV-MCTS) that interleaves simulations using multiple policy-value neural networks (PV-NNs) of differing computational cost and predictive accuracy. In this approach, two PV-NNs—a “small net” () and a “large net” ()—are jointly leveraged within a shared tree expansion and backup mechanism. This framework enables the agent to benefit from both high simulation throughput and improved policy/value estimation, offering a principled balance between search breadth and evaluative accuracy. Empirical studies on the 9×9 NoGo domain demonstrate statistically significant gains over conventional PV-MCTS and enhanced efficiency in AlphaZero-style self-play training (Lan et al., 2019).
1. Network Architectures, Training, and Intuition
MPV-MCTS employs two distinct PV-NNs, each grounded in AlphaZero-style residual towers:
- (“small net”):
- Supervised: 200K NoGo self-play games ( positions).
- AlphaZero (AZ): 800 simulations, PUCT , replay buffer 100K, 500K SGD steps on 60 self-play GPUs + 4 training GPUs (M games).
- (“large net”):
Architecture: 128 filters, 10 residual blocks (). Cost: baseline for one unit of normalized budget. Training: Identical data and AZ regimen with separate weight schedule.
The design rationale is that enables rapid rollouts and broader search, while provides more reliable policy priors and state-value estimates. MPV-MCTS fuses their outputs through coordinated search and aggregation.
2. Mathematical Fusion and Shared-Tree Framework
MPV-MCTS grows two parallel trees, and , sharing priors and value estimates at co-visited nodes. Given state :
The combined policy and value used in selection and backup are:
where . In the reported experiments, , (i.e., , ).
Selection in each tree follows the PUCT formula:
Upon evaluation of a leaf node , with result , standard MCTS back-up is performed: Updates from and reinforce each other through shared and values.
3. Simulation Scheduling and Algorithmic Details
Given simulation budgets for and respectively, the algorithm interleaves simulations with and with . The default scheduling samples of the iterations uniformly at random for the small net, the remainder for the large net.
For simulations:
- SELECT in via PUCT to obtain a leaf .
- EXPAND & EVALUATE: .
- BACKUP in and update shared statistics.
For simulations:
- Select unevaluated leaf in with highest (visit-count from ); fallback to PUCT if no such node.
- EXPAND & EVALUATE: .
- BACKUP in and update shared statistics.
The normalized action distribution for play is .
Alternative scheduling (e.g., round-robin, front-loading ) is permitted, provided budget constraints are respected.
4. Balancing Exploration and High-Fidelity Estimation
Breadth of the search is attributed to , as its low computational cost leads to extensive rollouts (), thus constructing large . Accuracy, in contrast, comes from via select, more costly evaluations (), targeted at the most promising regions as indicated by through the priority.
Budget allocation is guided by a parameter : , for total budget . Empirically, yielded optimal performance in NoGo. Mixing weights can be tuned to favor stronger networks (e.g., to bias toward ).
5. Experimental Results: NoGo and AlphaZero Self-Play
Performance was validated using NoGo and AlphaZero self-play training protocols:
a) Supervised NoGo, 9×9:
- Budgets normalized units.
- alone: peak Elo ().
- alone: peak Elo.
- MPV-MCTS (): peak Elo ( over large-only). Outperforms all intermediate-sized nets.
b) AZ-trained PV-NNs:
- $800$ self-play simulations, , $100$K buffer, M games.
- At eight checkpoints, evaluate each , and MPV.
- MPV-MCTS consistently exceeds both constituent nets by tens of Elo across all test budgets.
c) AlphaZero training with MPV-MCTS:
- Self-play: per move.
- Baselines: -only with $200, 400, 800$ sims/move.
- Equalized “generated-game” budget: sim $1$ unit, sim unit, MPV unit.
- After $2$M games:
- Best : Elo ($200$-sim test), Elo ($800$-sim test).
- MPV: Elo / Elo ( / over baseline).
- MPV trained for games outperforms trained for $2N$ games at 51.2–56.6% win-rates—approximately a acceleration in training.
6. Generalization, Extensions, and Practical Implications
MPV-MCTS is a general multi-teacher extension of PV-MCTS; any set of PV-NNs with differing accuracy/computational cost can be utilized. The scheme permits:
- A curriculum of incrementally larger “support” nets, with meta-learned mixing weights and simulation scheduling.
- Refinement of the priority function, such as PUCT-based selection, discounted parent visits, or hybrid heuristics.
- Deployment of “micro-nets” (e.g., 32 filters) under tight constraints; ablation indicates the tree search driven by is critical regardless of individual network strength.
- Applicability to any turn-based game (e.g., chess, shogi, Hex) or continuous-action RL tasks leveraging fast “sampler” alongside a slower “accurate” network.
- Facilitates ensemble approaches or knowledge distillation: smaller nets may acquire high-quality targets from the Q-values of larger nets, potentially accelerating neural convergence.
In sum, MPV-MCTS realizes a theoretically principled and empirically validated division of labor between rapid search and high-fidelity evaluation, substantially enhancing both online performance and the efficiency of reinforcement learning in complex domains (Lan et al., 2019).