Progressive MCGS: Efficient Graph Search
- Progressive MCGS is a DAG-based extension of MCTS that merges transposition-equivalent states to share information and improve search efficiency.
- It employs refined simulation, expansion, and backpropagation techniques, reducing memory usage by 30–70% and NN evaluations by 25–40%.
- Empirical results in chess, Crazyhouse, and POMDP benchmarks demonstrate significant performance gains and effective offline planning.
Progressive Monte Carlo Graph Search (MCGS) is a family of algorithms that generalize the classical Monte Carlo Tree Search (MCTS) framework to operate over directed acyclic graphs (DAGs) rather than trees. By merging transposition-equivalent states or beliefs, MCGS variants share information across different trajectories, enable substantial reductions in memory and computational cost, and support improved search efficacy in both fully and partially observable domains. This approach has been rigorously formulated for deterministic discrete domains such as Chess and Crazyhouse, as well as for large-scale offline planning in partially observable Markov decision processes (POMDPs) (Czech et al., 2020, You et al., 28 Jul 2025).
1. DAG-Based Search: Core Structure and Rationale
In Progressive MCGS, each unique state (in deterministic settings) or belief (in POMDPs) is represented as a node in a DAG rather than a tree. For deterministic games, nodes correspond to board positions updated with side-to-move and a step counter, while in POMDPs, nodes represent particle-based beliefs over the underlying state space. Edges encode action or observation-conditioned transitions.
The critical feature is "transposition detection" or "belief merging": when a new state (or belief) is encountered, the algorithm uses a hash or a defined metric (such as distance over belief particles with threshold ) to determine if an equivalent node already exists. On match, a new incoming edge is inserted; otherwise, a novel node is created. This folding of equivalent states or beliefs ensures information from all encounter trajectories is shared, providing both memory savings and faster learning of value estimates (Czech et al., 2020, You et al., 28 Jul 2025).
| Feature | MCTS Tree | Progressive MCGS DAG |
|---|---|---|
| State duplication | Yes | No (merged via key or metric) |
| Memory efficiency | Lower | Higher (30-70% reduction) |
| Cross-trajectory learning | No | Yes |
This suggests that as transposition density or belief redundancy increases, gains from DAG-based merging become more significant.
2. Simulation, Expansion, and Backpropagation Mechanisms
Selection and expansion in MCGS maintain extensions of the standard PUCT approach: at each node, a child action is selected to maximize , with defined via policy priors, visit counts, and a PUCT constant,
where schedules according to total child visits (Czech et al., 2020).
In POMDP settings, selection uses UCB over action values at each FSC (finite-state controller) node:
Simulation proceeds along graph edges until expansion (new node or first action at node), at which point new NN outputs are stored (deterministic) or a particle-based belief update is performed (POMDPs).
Backpropagation in both cases involves updating edge- and node-level statistics (mean value and visit counts), but in DAGs, special care is given to transposition-induced "value leaks." If an already expanded transposition node is reached and the Q-value disparity exceeds a small threshold, local correction and early backpropagation are performed:
If 0, set
1
then clip 2 to value range and conclude the simulation, propagating corrections along the visited path (Czech et al., 2020). This avoids redundant neural network evaluations and accelerates convergence.
3. Progressive Graph Growth and Efficiency
A defining attribute is the dynamic, "progressive" expansion of the DAG: each simulation either inserts a new node (when a novel state or belief is seen) or adds a new incoming edge to an existing node. Since nodes are uniquely keyed (by position-hash and step for games; by 3-distance for beliefs in POMDPs), acyclicity is preserved.
Key mechanisms include:
- Statistic sharing: Visit counts and value estimates are aggregated over all edges to a node.
- Memory and compute efficiency: Only one copy of each unique node and its NN or value estimate is stored, reducing memory usage by 30–70% and neural network calls by 25–40% in practice (Czech et al., 2020).
- Early termination: Value correction at transpositions allows for immediate simulation cutoff and correction propagation, thus improving wall-clock efficiency.
In POMDP domains, beliefs are merged using a nearest neighbor search in 4-metric with threshold 5, and K-means clustering is used to limit the explosion of observation branches in continuous or high-cardinality settings (You et al., 28 Jul 2025).
4. Extensions: Exploration, Domain Knowledge, and Solver Enhancements
Progressive MCGS admits several extensions:
- ε-Greedy Exploration: With probability 6, a new simulation begins with a randomly selected, so-far-unexpanded action at a randomly chosen depth 7 for 8. These "disconnected" trajectories foster exploration and prevent local optima entrapment. In POMDP applications, progressive widening is also applied to grow candidate action-sets gradually as 9 (Czech et al., 2020, You et al., 28 Jul 2025).
- Revised Terminal Solver: Node states are classified as WIN, LOSS, DRAW, or their tablebase variants. End-ply lengths (e.g., forced mate sequences) are stored per node, and specific backpropagation rules determine node resolution and possible search pruning (e.g., upon TB_LOSS or resolved LOSS) (Czech et al., 2020).
- Domain-Knowledge Constraints: At selected expansion points, all forcing moves (e.g., checks in chess) are explored before others, independent of policy prior. This ensures critical tactical lines are not overlooked due to policy network bias. Like ε-greedy branches, these constraint-invoking expansions are isolated above the branching point (Czech et al., 2020).
- Observation Clustering and Belief Merging (POMDPs): In continuous observation spaces, K-means clustering (with 0–1) on observation samples allows the algorithm to group similar observations, associate them with a single merged belief, and avoid unmanageable graph growth (You et al., 28 Jul 2025).
5. Empirical Performance and Scalability
Progressive MCGS has been systematically evaluated in both deterministic domains (chess, crazyhouse) and large-scale POMDP benchmarks.
Deterministic Domains (Czech et al., 2020):
- Experiments using the CrazyAra engine demonstrated that MCGS yields an advantage of +110 Elo in Crazyhouse (5 s/move), and additive gains (totaling +310 Elo) when all extensions are combined, compared to AlphaZero* using standard PUCT/MCTS with a transposition hash.
- In chess (5 s/move), MCGS provides a combined gain of +69 Elo, with individual extensions contributing +10–+40 Elo.
- Memory usage was reduced by 30–70%, and neural network evaluations by 25–40% due to early termination and node merging.
POMDP Domains (You et al., 28 Jul 2025):
- On RockSample benchmarks (including large configurations such as RS(15,15)) POMCGS matches or exceeds offline SARSOP and is competitive with online planners such as DESPOT and AdaOPS.
- On continuous observation domains such as Light-Dark, POMCGS matches AdaOPS when observation dimensionality is modest and K-means clustering suffices (2–10). On high-dimensional observation problems (Laser Tag, 8-D), finding effective policies remains challenging, suggesting observation clustering as a current limitation.
- Storage and computation scale as 3, with up to thousands of graph nodes manageable before convergence. Simulation counts per graph update step are typically 4, particle counts 5–6.
Parameter sensitivity studies show that excessive belief merging (7 too high) degrades solution quality, while insufficient merging (8 too high or 9 too low) leads to computational blowup. Moderate values for APW (0, 1) are robust.
6. Practical Implementation and Pitfalls
For deterministic discrete domains:
- State hashing (augmented with move count) is used for unique node identification.
- All value and visit statistics are stored per DAG node or edge; neural network outputs are cached and reused across all incoming trajectories.
- Value correction and early termination at transpositions require careful backpropagation to avoid overcounting or oscillation.
In partially observable domains:
- Each node maintains a particle-based belief; belief merging utilizes a cover-tree for efficient nearest-neighbor search.
- K-means clustering on observations avoids graph explosion for continuous or complex 2, but fails as observation dimensionality increases.
- The accuracy of the fully observed MDP heuristic 3 for rollouts is critical.
Common failure cases include overly aggressive merging, which collapses useful distinctions between beliefs (yielding weak policies), and insufficient merging or clustering, which causes exponential resource consumption. The quality of the MDP heuristic and the appropriateness of APW and K-means parameters are essential for scalability and convergence.
7. Theoretical and Practical Significance
Progressive Monte Carlo Graph Search generalizes PUCT-style MCTS frameworks to enable information-efficient search in domains with significant state or belief redundancy. By extending to DAGs, introducing dynamic edge and node management, and integrating enhancement mechanisms such as ε-greedy exploration, refined terminal solving, and domain knowledge integration, MCGS achieves major gains in empirical strength, memory efficiency, and compute performance.
In deterministic board games, these methods outperform baseline AlphaZero-style MCTS both in Elo and computational footprint without requiring neural network retraining (Czech et al., 2020). In partially observable domains, MCGS enables the derivation of finite-state policy controllers for large POMDPs, providing a viable offline alternative to online planning in previously intractable settings (You et al., 28 Jul 2025).
Limitations remain in scaling to very high-dimensional observation spaces and in efficient graph merging at scale, particularly in POMDPs with rich dynamics. A plausible implication is that future advances in observation clustering and heuristic design will further increase the reach and efficacy of DAG-based Monte Carlo search methods.