Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multiple Policy Value MCTS (MPV-MCTS)

Updated 19 February 2026
  • MPV-MCTS is an extension of PV-MCTS that integrates two policy-value networks to jointly enhance search breadth and evaluative accuracy.
  • It interleaves rapid simulations from a fast 'small net' with selective high-fidelity evaluations from a 'large net' using a shared tree framework.
  • Empirical studies on 9×9 NoGo and AlphaZero self-play show significant Elo gains and a training acceleration of approximately 2× over conventional methods.

Multiple Policy Value Monte Carlo Tree Search (MPV-MCTS) is an extension of policy value Monte Carlo Tree Search (PV-MCTS) that interleaves simulations using multiple policy-value neural networks (PV-NNs) of differing computational cost and predictive accuracy. In this approach, two PV-NNs—a “small net” (fSf_S) and a “large net” (fLf_L)—are jointly leveraged within a shared tree expansion and backup mechanism. This framework enables the agent to benefit from both high simulation throughput and improved policy/value estimation, offering a principled balance between search breadth and evaluative accuracy. Empirical studies on the 9×9 NoGo domain demonstrate statistically significant gains over conventional PV-MCTS and enhanced efficiency in AlphaZero-style self-play training (Lan et al., 2019).

1. Network Architectures, Training, and Intuition

MPV-MCTS employs two distinct PV-NNs, each grounded in AlphaZero-style residual towers:

  • fSf_S (“small net”):
    • Supervised: 200K NoGo self-play games (107\approx 10^7 positions).
    • AlphaZero (AZ): 800 simulations, PUCT c=1.5c=1.5, replay buffer 100K, 500K SGD steps on 60 self-play GPUs + 4 training GPUs (2\approx 2M games).
  • fLf_L (“large net”):

Architecture: 128 filters, 10 residual blocks (f128,10f_{128,10}). Cost: baseline for one unit of normalized budget. Training: Identical data and AZ regimen with separate weight schedule.

The design rationale is that fSf_S enables rapid rollouts and broader search, while fLf_L provides more reliable policy priors and state-value estimates. MPV-MCTS fuses their outputs through coordinated search and aggregation.

2. Mathematical Fusion and Shared-Tree Framework

MPV-MCTS grows two parallel trees, TST_S and TLT_L, sharing priors and value estimates at co-visited nodes. Given state ss:

  • fS(s)(pS(s),VS(s))f_S(s) \rightarrow (p_S(\cdot|s), V_S(s))
  • fL(s)(pL(s),VL(s))f_L(s) \rightarrow (p_L(\cdot|s), V_L(s))

The combined policy PP and value VV used in selection and backup are: P(s,a)=βpS(as)+(1β)pL(as)P(s,a) = \beta\, p_S(a|s) + (1-\beta)\, p_L(a|s)

V(s)=αVS(s)+(1α)VL(s)V(s) = \alpha\, V_S(s) + (1-\alpha)\, V_L(s)

where α,β[0,1]\alpha,\beta \in [0,1]. In the reported experiments, α=0.5\alpha = 0.5, β=0\beta = 0 (i.e., P=pLP = p_L, V=(VS+VL)/2V = (V_S + V_L)/2).

Selection in each tree follows the PUCT formula: a=argmaxa[Q(s,a)+u(s,a)]a^* = \arg\max_a \left[ Q(s,a) + u(s,a) \right]

u(s,a)=cpuctP(s,a)N(s)/(1+N(s,a))u(s,a) = c_{\text{puct}} \cdot P(s,a) \cdot \sqrt{N(s)}/(1+N(s,a))

Upon evaluation of a leaf node sleafs_{\text{leaf}}, with result v=V(sleaf)v=V_{\cdot}(s_{\text{leaf}}), standard MCTS back-up is performed: N(s,a)N(s,a)+1 W(s,a)W(s,a)+v Q(s,a)W(s,a)/N(s,a)N(s,a) \leftarrow N(s,a)+1 \ W(s,a) \leftarrow W(s,a)+v \ Q(s,a) \leftarrow W(s,a)/N(s,a) Updates from TST_S and TLT_L reinforce each other through shared PP and QQ values.

3. Simulation Scheduling and Algorithmic Details

Given simulation budgets bSbLb_S \geq b_L for fSf_S and fLf_L respectively, the algorithm interleaves bSb_S simulations with fSf_S and bLb_L with fLf_L. The default scheduling samples bSb_S of the bS+bLb_S + b_L iterations uniformly at random for the small net, the remainder for the large net.

For fSf_S simulations:

  1. SELECT in TST_S via PUCT to obtain a leaf sleafs_{\text{leaf}}.
  2. EXPAND & EVALUATE: (p,v)=fS(sleaf)(p,v) = f_S(s_{\text{leaf}}).
  3. BACKUP in TST_S and update shared statistics.

For fLf_L simulations:

  1. Select unevaluated leaf in TLT_L with highest NS(s)N_S(s) (visit-count from TST_S); fallback to PUCT if no such node.
  2. EXPAND & EVALUATE: (p,v)=fL(sleaf)(p,v) = f_L(s_{\text{leaf}}).
  3. BACKUP in TLT_L and update shared statistics.

The normalized action distribution for play is π(a)NS(root,a)1/τ\pi(a) \propto N_S(\text{root},a)^{1/\tau}.

Alternative scheduling (e.g., round-robin, front-loading fLf_L) is permitted, provided budget constraints are respected.

4. Balancing Exploration and High-Fidelity Estimation

Breadth of the search is attributed to fSf_S, as its low computational cost leads to extensive rollouts (bSbLb_S \gg b_L), thus constructing large TST_S. Accuracy, in contrast, comes from fLf_L via select, more costly evaluations (bLb_L), targeted at the most promising regions as indicated by TST_S through the NSN_S priority.

Budget allocation is guided by a parameter r[0,1]r \in [0,1]: bL=rBb_L = rB, bS=(1r)Bb_S = (1-r)B for total budget BB. Empirically, r0.5r \approx 0.5 yielded optimal performance in NoGo. Mixing weights (α,β)(\alpha, \beta) can be tuned to favor stronger networks (e.g., (0.2,0)(0.2,0) to bias VV toward VLV_L).

5. Experimental Results: NoGo and AlphaZero Self-Play

Performance was validated using NoGo and AlphaZero self-play training protocols:

a) Supervised NoGo, 9×9:

  • Budgets B={100,200,400,800,1600}B = \{100, 200, 400, 800, 1600\} normalized units.
  • f64,5f_{64,5} alone: peak 323\sim323 Elo (B=1600B=1600).
  • f128,10f_{128,10} alone: peak 472\sim472 Elo.
  • MPV-MCTS (r=0.5r=0.5): peak 527\sim527 Elo (+55+55 over large-only). Outperforms all intermediate-sized nets.

b) AZ-trained PV-NNs:

  • $800$ self-play simulations, c=1.5c=1.5, $100$K buffer, 2\sim2M games.
  • At eight checkpoints, evaluate each fS,fLf_S, f_L, and MPV.
  • MPV-MCTS consistently exceeds both constituent nets by tens of Elo across all test budgets.

c) AlphaZero training with MPV-MCTS:

  • Self-play: (bS=800,bL=100)(b_S=800, b_L=100) per move.
  • Baselines: fLf_L-only with $200, 400, 800$ sims/move.
  • Equalized “generated-game” budget: fL@200f_L@200 sim \rightarrow $1$ unit, fL@800f_L@800 sim 0.25\rightarrow 0.25 unit, MPV 1\rightarrow 1 unit.
  • After $2$M games:
    • Best fL@800f_L@800: +53+53 Elo ($200$-sim test), +118+118 Elo ($800$-sim test).
    • MPV: +279+279 Elo / +370+370 Elo (+226+226 / +252+252 over baseline).
    • MPV trained for NN games outperforms fL@800f_L@800 trained for $2N$ games at 51.2–56.6% win-rates—approximately a 2×2\times acceleration in training.

6. Generalization, Extensions, and Practical Implications

MPV-MCTS is a general multi-teacher extension of PV-MCTS; any set of PV-NNs with differing accuracy/computational cost can be utilized. The scheme permits:

  • A curriculum of incrementally larger “support” nets, with meta-learned mixing weights and simulation scheduling.
  • Refinement of the fLf_L priority function, such as PUCT-based selection, discounted parent visits, or hybrid heuristics.
  • Deployment of “micro-nets” (e.g., 32 filters) under tight constraints; ablation indicates the tree search driven by TST_S is critical regardless of individual network strength.
  • Applicability to any turn-based game (e.g., chess, shogi, Hex) or continuous-action RL tasks leveraging fast “sampler” alongside a slower “accurate” network.
  • Facilitates ensemble approaches or knowledge distillation: smaller nets may acquire high-quality targets from the Q-values of larger nets, potentially accelerating neural convergence.

In sum, MPV-MCTS realizes a theoretically principled and empirically validated division of labor between rapid search and high-fidelity evaluation, substantially enhancing both online performance and the efficiency of reinforcement learning in complex domains (Lan et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multiple Policy Value MCTS (MPV-MCTS).