Papers
Topics
Authors
Recent
2000 character limit reached

Multi-Stage Zero-Variance Elimination

Updated 9 December 2025
  • Multi-Stage Zero-Variance Elimination is a variance reduction technique that uses stage-wise predictive baselines to yield deterministic continuation value estimates.
  • It partitions game trajectories into multiple stages to compute unbiased, zero-variance estimators, thereby accelerating convergence compared to standard MCCFR.
  • Empirical results in complex games like Leduc poker demonstrate 5–10× speedups and improved scalability with reduced sampling noise.

Multi-Stage Zero-Variance Elimination is a variance reduction technique for sampling-based algorithms in extensive-form games (EFGs) that leverages predictive baselines at multiple depth segments of the game tree. By systematically constructing and applying stage-wise predictive baselines along sampled trajectories, this method achieves deterministic value estimates per iteration, thereby eliminating the high variance present in standard Monte Carlo Counterfactual Regret Minimization (MCCFR) and leading to significantly improved empirical convergence rates (Davis et al., 2019).

1. Extensive-Form Games and Variance in Regret Minimization

Extensive-form games provide a general model for multi-agent interactions with imperfect information, formalized by:

  • Players: finite set NN (typically N={1,2}N=\{1,2\} for two-player games).
  • Histories HH: prefixes of action sequences; terminal histories Z⊆HZ \subseteq H.
  • Player-to-move function P:H∖Z→NâˆȘ{c}P: H \setminus Z \rightarrow N \cup \{c\}, where cc denotes chance, using a fixed distribution ρ(h)∈ΔA(h)\rho(h) \in \Delta_{A(h)}.
  • Information sets Ii\mathcal{I}_i: partitions of player ii's move histories.
  • Behavioral strategies σ={σi∣i∈N}\sigma = \{\sigma_i \mid i \in N\} assign distributions to actions at each infoset.
  • Utilities ui:Z→Ru_i : Z \to \mathbb{R}, with u1(z)+u2(z)=0u_1(z) + u_2(z) = 0 for two-player zero-sum.
  • Expected utility: Ui(σ)=∑z∈Zπσ(z)ui(z)U_i(\sigma) = \sum_{z \in Z} \pi^\sigma(z) u_i(z), where πσ(h)\pi^\sigma(h) is reach probability.

Counterfactual regret minimization (CFR) algorithms update regrets and strategies by traversing the complete tree, but this is infeasible in large games. MCCFR overcomes this by sampling single trajectories but suffers from high variance in its estimators, dramatically slowing convergence (Davis et al., 2019).

2. Baseline-Corrected Value Estimation Framework

Variance in MCCFR is addressed through control variates, or baselines:

  • For each (h,a)(h,a), a baseline b(h,a)b(h,a) approximates Vi(σ;ha)V_i(\sigma; ha), the continuation value.
  • The baseline-corrected sampled utility is:

V^i(h,a;z;b)=1(ha)⊑zui(z)−b(h,a)Pr⁡[(ha)⊑z]+b(h,a)\hat{V}_i(h,a; z; b) = \mathbb{1}_{(ha)\sqsubseteq z} \frac{u_i(z) - b(h,a)}{\Pr[(ha)\sqsubseteq z]} + b(h,a)

with Pr⁥[(ha)⊑z]=σ(h,a)⋅Pr⁥[z⊒ha]\Pr[(ha)\sqsubseteq z] = \sigma(h,a) \cdot \Pr[z \sqsupseteq ha].

  • Aggregated for a sequence, this leads to the trajectory sum estimator:

v^i(σ,z;b)=∑t=0T−1π−i(ht)π−i(z)[ui(z)−b(ht,at)]+B(z)\hat{v}_i(\sigma, z; b) = \sum_{t=0}^{T-1} \frac{\pi_{-i}(h_t)}{\pi_{-i}(z)} [u_i(z) - b(h_t, a_t)] + B(z)

where B(z)=∑t=0T−1π−i(ht)π−i(z)b(ht,at)B(z) = \sum_{t=0}^{T-1} \frac{\pi_{-i}(h_t)}{\pi_{-i}(z)} b(h_t, a_t).

These estimators are unbiased for the expected utilities and preserve CFR convergence guarantees.

3. Predictive Baselines and Zero-Variance Propagation

If the baseline b∗(h,a)b^*(h,a) is set to the exact continuation value Vi(σ;ha)V_i(\sigma; ha) (termed the oracle or predictive baseline), then each term in the estimator becomes deterministic, as its variance collapses to zero:

Var⁡[v^i(z;b∗)]=0\operatorname{Var}[\hat{v}_i(z; b^*)] = 0

Although impractical globally, MCCFR's depth-first walk allows for on-the-fly recursive construction of predictive baselines along sampled trajectories. After each iteration, one can recompute the same trajectory under the updated strategy, setting:

bt+1(h,a)←sampled Vi(σt+1;h,a)b^{t+1}(h,a) \leftarrow \text{sampled } V_i(\sigma^{t+1}; h,a)

By induction, these baselines remain unbiased for the current strategy's continuation values. When applied, they guarantee the resulting sample is equal to Ui(σ)U_i(\sigma) deterministically on every sampled trajectory, conditional on the predictive baselines (Davis et al., 2019).

4. Multi-Stage Zero-Variance Elimination Methodology

For deep trees, traversing and updating baselines for each node along the entire trajectory can be costly. Multi-Stage Zero-Variance Elimination overcomes this by partitioning the trajectory into KK consecutive stages, determined by increasing depth thresholds 0=d0<d1<
<dK=0 = d_0 < d_1 < \ldots < d_K = terminal depth. For stage kk:

  • A separate predictive baseline bk(h,a)b_k(h,a) is maintained for depths dk−1d_{k-1} through dkd_k.
  • The trajectory zz is divided as z=(h0,a0,...,hd1,ad1,...,hd2,...,z)z=(h_0,a_0, ..., h_{d_1},a_{d_1}, ..., h_{d_2}, ..., z), with z(k)z^{(k)} denoting the stage-kk suffix.
  • At each iteration, zero-variance estimators V^i(k)(hdk−1;z;bk(t))\hat{V}_i^{(k)}(h_{d_{k-1}}; z; b_k(t)) are recursively computed for each stage.

The overall sample-value estimate is accumulated as:

v^i(z)=∑k=1KV^i(k)(hdk−1;z)\hat{v}_i(z) = \sum_{k=1}^K \hat{V}_i^{(k)}(h_{d_{k-1}}; z)

Each stage's value is exact (zero variance), hence the total estimator is deterministic given the predictive baselines.

Algorithm (plain-text pseudocode)

1
2
3
4
5
6
7
8
9
10
11
12
initialize regrets, baselines b_k^1(h,a)=0 for all h,a and k=1
K
for t=1
T do
  sample single terminal trajectory z∌σ^t
  for k=1 to K do
    let h_start←node at depth d_{k−1} on z
    recursively compute baseline-corrected values V̂^{(k)} using b_k^t on z^{(k)}
    update regret sums at all infosets in stage k by
      r_t(I,a) ← π_{−i}^t(h) / π_{−i}^t(z) ⋅ [V̂^{(k)}(h,a) − V̂^{(k)}(h)]
  end for
  update each baseline b_k^{t+1}(h,a)
  perform regret-matching for σ^{t+1}
end for
(Davis et al., 2019)

5. Theoretical Results

The multi-stage scheme inherits all theoretical properties of the baseline-corrected MCCFR:

  • Unbiasedness: At every infoset, the stage-aggregated estimator v^i\hat{v}_i is unbiased for Ui(σt)U_i(\sigma^t).
  • Zero-variance: If predictive baselines are updated to be exact at each stage, then for any trajectory, v^i(z)\hat{v}_i(z) is deterministic with variance zero.
  • Regret minimization: The average regret converges to zero at rate O(1/T)O(1/\sqrt{T}). In the absence of variance, empirical convergence accelerates, approaching the rate of full-tree CFR traversal (Davis et al., 2019).

6. Empirical Validation and Implementation

Empirical evaluation in Leduc poker and large hold'em variants yields:

  • Use of predictive baselines in single-stage settings reduces estimator variance to zero (after full coverage), delivering convergence rates comparable to full-tree CFR.
  • Multi-stage elimination matches the accuracy of CFR's full trajectory evaluation, while sampling only a single path, producing 5–10× speedups in wall-clock time.
  • Baseline learning from trajectories (using history-dependent baselines) provides an order-of-magnitude improvement in convergence over no baseline.
  • In practical terms, the per-iteration cost is O(K⋅depth⋅max⁥∣A∣)O(K \cdot \text{depth} \cdot \max |A|), which remains acceptable since typically Kâ‰ȘdepthK \ll \text{depth}. Memory requirements scale with the number of visited node-action pairs per stage. For large domains, transposition tables or function approximation (e.g., neural networks) can be employed to generalize and share baselines (Davis et al., 2019).

7. Practical Implications and Future Directions

The deployment of Multi-Stage Zero-Variance Elimination establishes a practical paradigm for fast, low-variance learning in extensive-form games with large or deep game trees. A plausible implication is heightened scalability to domains previously limited by sampling noise and regret convergence speed. While the only residual stochasticity arises from which infosets are chosen for update at each iteration, sophisticated sampling schemes may further diminish this effect. Advances in large-scale function approximation for baseline storage and transfer suggest continued applicability to real-world imperfect-information games, contributing to faster and more robust equilibrium finding (Davis et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Multi-Stage Zero-Variance Elimination.