Multi-Stage Zero-Variance Elimination
- Multi-Stage Zero-Variance Elimination is a variance reduction technique that uses stage-wise predictive baselines to yield deterministic continuation value estimates.
- It partitions game trajectories into multiple stages to compute unbiased, zero-variance estimators, thereby accelerating convergence compared to standard MCCFR.
- Empirical results in complex games like Leduc poker demonstrate 5â10Ă speedups and improved scalability with reduced sampling noise.
Multi-Stage Zero-Variance Elimination is a variance reduction technique for sampling-based algorithms in extensive-form games (EFGs) that leverages predictive baselines at multiple depth segments of the game tree. By systematically constructing and applying stage-wise predictive baselines along sampled trajectories, this method achieves deterministic value estimates per iteration, thereby eliminating the high variance present in standard Monte Carlo Counterfactual Regret Minimization (MCCFR) and leading to significantly improved empirical convergence rates (Davis et al., 2019).
1. Extensive-Form Games and Variance in Regret Minimization
Extensive-form games provide a general model for multi-agent interactions with imperfect information, formalized by:
- Players: finite set (typically for two-player games).
- Histories : prefixes of action sequences; terminal histories .
- Player-to-move function , where denotes chance, using a fixed distribution .
- Information sets : partitions of player 's move histories.
- Behavioral strategies assign distributions to actions at each infoset.
- Utilities , with for two-player zero-sum.
- Expected utility: , where is reach probability.
Counterfactual regret minimization (CFR) algorithms update regrets and strategies by traversing the complete tree, but this is infeasible in large games. MCCFR overcomes this by sampling single trajectories but suffers from high variance in its estimators, dramatically slowing convergence (Davis et al., 2019).
2. Baseline-Corrected Value Estimation Framework
Variance in MCCFR is addressed through control variates, or baselines:
- For each , a baseline approximates , the continuation value.
- The baseline-corrected sampled utility is:
with .
- Aggregated for a sequence, this leads to the trajectory sum estimator:
where .
These estimators are unbiased for the expected utilities and preserve CFR convergence guarantees.
3. Predictive Baselines and Zero-Variance Propagation
If the baseline is set to the exact continuation value (termed the oracle or predictive baseline), then each term in the estimator becomes deterministic, as its variance collapses to zero:
Although impractical globally, MCCFR's depth-first walk allows for on-the-fly recursive construction of predictive baselines along sampled trajectories. After each iteration, one can recompute the same trajectory under the updated strategy, setting:
By induction, these baselines remain unbiased for the current strategy's continuation values. When applied, they guarantee the resulting sample is equal to deterministically on every sampled trajectory, conditional on the predictive baselines (Davis et al., 2019).
4. Multi-Stage Zero-Variance Elimination Methodology
For deep trees, traversing and updating baselines for each node along the entire trajectory can be costly. Multi-Stage Zero-Variance Elimination overcomes this by partitioning the trajectory into consecutive stages, determined by increasing depth thresholds terminal depth. For stage :
- A separate predictive baseline is maintained for depths through .
- The trajectory is divided as , with denoting the stage- suffix.
- At each iteration, zero-variance estimators are recursively computed for each stage.
The overall sample-value estimate is accumulated as:
Each stage's value is exact (zero variance), hence the total estimator is deterministic given the predictive baselines.
Algorithm (plain-text pseudocode)
1 2 3 4 5 6 7 8 9 10 11 12 |
initialize regrets, baselines b_k^1(h,a)=0 for all h,a and k=1âŠK
for t=1âŠT do
sample single terminal trajectory zâŒÏ^t
for k=1 to K do
let h_startânode at depth d_{kâ1} on z
recursively compute baseline-corrected values VÌ^{(k)} using b_k^t on z^{(k)}
update regret sums at all infosets in stage k by
r_t(I,a) â Ï_{âi}^t(h) / Ï_{âi}^t(z) â
[VÌ^{(k)}(h,a) â VÌ^{(k)}(h)]
end for
update each baseline b_k^{t+1}(h,a)
perform regret-matching for Ï^{t+1}
end for |
5. Theoretical Results
The multi-stage scheme inherits all theoretical properties of the baseline-corrected MCCFR:
- Unbiasedness: At every infoset, the stage-aggregated estimator is unbiased for .
- Zero-variance: If predictive baselines are updated to be exact at each stage, then for any trajectory, is deterministic with variance zero.
- Regret minimization: The average regret converges to zero at rate . In the absence of variance, empirical convergence accelerates, approaching the rate of full-tree CFR traversal (Davis et al., 2019).
6. Empirical Validation and Implementation
Empirical evaluation in Leduc poker and large hold'em variants yields:
- Use of predictive baselines in single-stage settings reduces estimator variance to zero (after full coverage), delivering convergence rates comparable to full-tree CFR.
- Multi-stage elimination matches the accuracy of CFR's full trajectory evaluation, while sampling only a single path, producing 5â10Ă speedups in wall-clock time.
- Baseline learning from trajectories (using history-dependent baselines) provides an order-of-magnitude improvement in convergence over no baseline.
- In practical terms, the per-iteration cost is , which remains acceptable since typically . Memory requirements scale with the number of visited node-action pairs per stage. For large domains, transposition tables or function approximation (e.g., neural networks) can be employed to generalize and share baselines (Davis et al., 2019).
7. Practical Implications and Future Directions
The deployment of Multi-Stage Zero-Variance Elimination establishes a practical paradigm for fast, low-variance learning in extensive-form games with large or deep game trees. A plausible implication is heightened scalability to domains previously limited by sampling noise and regret convergence speed. While the only residual stochasticity arises from which infosets are chosen for update at each iteration, sophisticated sampling schemes may further diminish this effect. Advances in large-scale function approximation for baseline storage and transfer suggest continued applicability to real-world imperfect-information games, contributing to faster and more robust equilibrium finding (Davis et al., 2019).