Multiple Policy Value MCTS (MPV-MCTS)

Updated 19 February 2026

MPV-MCTS is an extension of PV-MCTS that integrates two policy-value networks to jointly enhance search breadth and evaluative accuracy.
It interleaves rapid simulations from a fast 'small net' with selective high-fidelity evaluations from a 'large net' using a shared tree framework.
Empirical studies on 9×9 NoGo and AlphaZero self-play show significant Elo gains and a training acceleration of approximately 2× over conventional methods.

Multiple Policy Value Monte Carlo Tree Search (MPV-MCTS) is an extension of policy value Monte Carlo Tree Search (PV-MCTS) that interleaves simulations using multiple policy-value neural networks (PV-NNs) of differing computational cost and predictive accuracy. In this approach, two PV-NNs—a “small net” ( $f_S$ ) and a “large net” ( $f_L$ )—are jointly leveraged within a shared tree expansion and backup mechanism. This framework enables the agent to benefit from both high simulation throughput and improved policy/value estimation, offering a principled balance between search breadth and evaluative accuracy. Empirical studies on the 9×9 NoGo domain demonstrate statistically significant gains over conventional PV-MCTS and enhanced efficiency in AlphaZero-style self-play training (Lan et al., 2019).

1. Network Architectures, Training, and Intuition

MPV-MCTS employs two distinct PV-NNs, each grounded in AlphaZero-style residual towers:

$f_S$ (“small net”):
- Supervised: 200K NoGo self-play games ( $\approx 10^7$ positions).
- AlphaZero (AZ): 800 simulations, PUCT $c=1.5$ , replay buffer 100K, 500K SGD steps on 60 self-play GPUs + 4 training GPUs ( $\approx 2$ M games).
$f_L$ (“large net”):

Architecture: 128 filters, 10 residual blocks ( $f_{128,10}$ ). Cost: baseline for one unit of normalized budget. Training: Identical data and AZ regimen with separate weight schedule.

The design rationale is that $f_S$ enables rapid rollouts and broader search, while $f_L$ provides more reliable policy priors and state-value estimates. MPV-MCTS fuses their outputs through coordinated search and aggregation.

2. Mathematical Fusion and Shared-Tree Framework

MPV-MCTS grows two parallel trees, $T_S$ and $T_L$ , sharing priors and value estimates at co-visited nodes. Given state $s$ :

$f_S(s) \rightarrow (p_S(\cdot|s), V_S(s))$
$f_L(s) \rightarrow (p_L(\cdot|s), V_L(s))$

The combined policy $P$ and value $V$ used in selection and backup are: $P(s,a) = \beta\, p_S(a|s) + (1-\beta)\, p_L(a|s)$

$V(s) = \alpha\, V_S(s) + (1-\alpha)\, V_L(s)$

where $\alpha,\beta \in [0,1]$ . In the reported experiments, $\alpha = 0.5$ , $\beta = 0$ (i.e., $P = p_L$ , $V = (V_S + V_L)/2$ ).

Selection in each tree follows the PUCT formula: $a^* = \arg\max_a \left[ Q(s,a) + u(s,a) \right]$

$u(s,a) = c_{\text{puct}} \cdot P(s,a) \cdot \sqrt{N(s)}/(1+N(s,a))$

Upon evaluation of a leaf node $s_{\text{leaf}}$ , with result $v=V_{\cdot}(s_{\text{leaf}})$ , standard MCTS back-up is performed: $N(s,a) \leftarrow N(s,a)+1 \ W(s,a) \leftarrow W(s,a)+v \ Q(s,a) \leftarrow W(s,a)/N(s,a)$ Updates from $T_S$ and $T_L$ reinforce each other through shared $P$ and $Q$ values.

3. Simulation Scheduling and Algorithmic Details

Given simulation budgets $b_S \geq b_L$ for $f_S$ and $f_L$ respectively, the algorithm interleaves $b_S$ simulations with $f_S$ and $b_L$ with $f_L$ . The default scheduling samples $b_S$ of the $b_S + b_L$ iterations uniformly at random for the small net, the remainder for the large net.

For $f_S$ simulations:

SELECT in $T_S$ via PUCT to obtain a leaf $s_{\text{leaf}}$ .
EXPAND & EVALUATE: $(p,v) = f_S(s_{\text{leaf}})$ .
BACKUP in $T_S$ and update shared statistics.

For $f_L$ simulations:

Select unevaluated leaf in $T_L$ with highest $N_S(s)$ (visit-count from $T_S$ ); fallback to PUCT if no such node.
EXPAND & EVALUATE: $(p,v) = f_L(s_{\text{leaf}})$ .
BACKUP in $T_L$ and update shared statistics.

The normalized action distribution for play is $\pi(a) \propto N_S(\text{root},a)^{1/\tau}$ .

Alternative scheduling (e.g., round-robin, front-loading $f_L$ ) is permitted, provided budget constraints are respected.

4. Balancing Exploration and High-Fidelity Estimation

Breadth of the search is attributed to $f_S$ , as its low computational cost leads to extensive rollouts ( $b_S \gg b_L$ ), thus constructing large $T_S$ . Accuracy, in contrast, comes from $f_L$ via select, more costly evaluations ( $b_L$ ), targeted at the most promising regions as indicated by $T_S$ through the $N_S$ priority.

Budget allocation is guided by a parameter $r \in [0,1]$ : $b_L = rB$ , $b_S = (1-r)B$ for total budget $B$ . Empirically, $r \approx 0.5$ yielded optimal performance in NoGo. Mixing weights $(\alpha, \beta)$ can be tuned to favor stronger networks (e.g., $(0.2,0)$ to bias $V$ toward $V_L$ ).

5. Experimental Results: NoGo and AlphaZero Self-Play

Performance was validated using NoGo and AlphaZero self-play training protocols:

a) Supervised NoGo, 9×9:

Budgets $B = \{100, 200, 400, 800, 1600\}$ normalized units.
$f_{64,5}$ alone: peak $\sim323$ Elo ( $B=1600$ ).
$f_{128,10}$ alone: peak $\sim472$ Elo.
MPV-MCTS ( $r=0.5$ ): peak $\sim527$ Elo ( $+55$ over large-only). Outperforms all intermediate-sized nets.

b) AZ-trained PV-NNs:

$800$ self-play simulations, $c=1.5$ , $100$K buffer, $\sim2$ M games.
At eight checkpoints, evaluate each $f_S, f_L$ , and MPV.
MPV-MCTS consistently exceeds both constituent nets by tens of Elo across all test budgets.

c) AlphaZero training with MPV-MCTS:

Self-play: $(b_S=800, b_L=100)$ per move.
Baselines: $f_L$ -only with $200, 400, 800$ sims/move.
Equalized “generated-game” budget: $f_L@200$ sim $\rightarrow$ $1$ unit, $f_L@800$ sim $\rightarrow 0.25$ unit, MPV $\rightarrow 1$ unit.
After $2$M games:
- Best $f_L@800$ : $+53$ Elo ($200$-sim test), $+118$ Elo ($800$-sim test).
- MPV: $+279$ Elo / $+370$ Elo ( $+226$ / $+252$ over baseline).
- MPV trained for $N$ games outperforms $f_L@800$ trained for $2N$ games at 51.2–56.6% win-rates—approximately a $2\times$ acceleration in training.

6. Generalization, Extensions, and Practical Implications

MPV-MCTS is a general multi-teacher extension of PV-MCTS; any set of PV-NNs with differing accuracy/computational cost can be utilized. The scheme permits:

A curriculum of incrementally larger “support” nets, with meta-learned mixing weights and simulation scheduling.
Refinement of the $f_L$ priority function, such as PUCT-based selection, discounted parent visits, or hybrid heuristics.
Deployment of “micro-nets” (e.g., 32 filters) under tight constraints; ablation indicates the tree search driven by $T_S$ is critical regardless of individual network strength.
Applicability to any turn-based game (e.g., chess, shogi, Hex) or continuous-action RL tasks leveraging fast “sampler” alongside a slower “accurate” network.
Facilitates ensemble approaches or knowledge distillation: smaller nets may acquire high-quality targets from the Q-values of larger nets, potentially accelerating neural convergence.

In sum, MPV-MCTS realizes a theoretically principled and empirically validated division of labor between rapid search and high-fidelity evaluation, substantially enhancing both online performance and the efficiency of reinforcement learning in complex domains (Lan et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

Multiple Policy Value Monte Carlo Tree Search (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multiple Policy Value MCTS (MPV-MCTS).

Multiple Policy Value MCTS (MPV-MCTS)

1. Network Architectures, Training, and Intuition

2. Mathematical Fusion and Shared-Tree Framework

3. Simulation Scheduling and Algorithmic Details

4. Balancing Exploration and High-Fidelity Estimation

5. Experimental Results: NoGo and AlphaZero Self-Play

6. Generalization, Extensions, and Practical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multiple Policy Value MCTS (MPV-MCTS)

1. Network Architectures, Training, and Intuition

2. Mathematical Fusion and Shared-Tree Framework

3. Simulation Scheduling and Algorithmic Details

4. Balancing Exploration and High-Fidelity Estimation

5. Experimental Results: NoGo and AlphaZero Self-Play

6. Generalization, Extensions, and Practical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research