DoubleRegretNet: Neural CFR Framework

Updated 22 January 2026

DoubleRegretNet is a double neural network approach that approximates cumulative regret and average strategies using RSN and ASN, respectively.
It leverages mini-batch MCCFR, robust sampling, and MCCFR+ to match near tabular CFR+ performance with significantly fewer parameters.
The framework achieves fast convergence and scalability in imperfect information games, effectively bridging classical CFR with modern deep learning.

DoubleRegretNet is a double neural network framework for counterfactual regret minimization in imperfect information games. It utilizes two distinct neural networks—RegretSumNetwork (RSN) and AvgStrategyNetwork (ASN)—to approximate the cumulative regret and average strategy profile, respectively, thereby overcoming the scalability limitations of classical tabular CFR and enabling continual improvement from non-optimal initializations. The architecture incorporates robust sampling, mini-batch Monte Carlo Counterfactual Regret Minimization (MCCFR), and MCCFR+, yielding near tabular CFR+ performance with two orders of magnitude fewer parameters (1812.10607).

1. Double Neural Architecture

DoubleRegretNet consists of two parameterized networks, each mapping information sets $I$ and legal actions $a \in A(I)$ to scalar outputs:

RegretSumNetwork (RSN): Approximates normalized cumulative regret $\hat{R}_i^t(a|I)$ via a function $f_{\text{R}}(a, I|\theta^t)$ .
AvgStrategyNetwork (ASN): Approximates the cumulative numerator of the average strategy $S_i^t(a|I)$ via $f_{\text{S}}(a, I|\phi^t)$ .

Both networks receive a variable-length, embedded history prefix using a one-layer LSTM (hidden dimension $d \in \{8,16\}$ ), followed by an attention mechanism over LSTM outputs to yield a context vector, which then passes through a small feed-forward “value head” (ReLU to linear). The input encoding includes private card (one-hot), public cards (one-hot or zero-vector), and the last action (one-hot, including “cumulative spent”). Each network typically requires only $1\,000$ – $3\,000$ parameters, enabling significant compression relative to tabular representations.

2. Mathematical Formalization

Standard CFR Definitions

Let $v_i^\sigma(h)$ denote the counterfactual value of player $i$ at history $h$ , $r_i^\sigma(a|I)$ the instantaneous regret, and $R_i^T(a|I)$ the cumulative regret after $T$ iterations: $v_i^\sigma(h) = \sum_{z \sqsupseteq h} \pi^\sigma_{-i}(h)\,\pi^\sigma(h \to z)\,u_i(z)$

$r_i^\sigma(a|I) = v_i^\sigma(I, a) - v_i^\sigma(I)$

$R_i^T(a|I) = \sum_{t=1}^T r_i^{\sigma^t}(a|I)$

$\sigma_i^{T+1}(a|I) = \begin{cases} \frac{R_i^{T,+}(a|I)}{\sum_{a'}R_i^{T,+}(a'|I)} & \text{if sum}>0 \ 1/|A(I)| & \text{otherwise} \end{cases}$

$\bar{\sigma}_i^T(a|I) = \frac{S_i^T(a|I)}{\sum_{a'}S_i^T(a'|I)}$

where $S_i^T(a|I)$ is the cumulative numerator for the average strategy.

Neural Approximations

Regret and average strategy values are stored in network weights instead of large tables:

$f_{\text{R}}(a, I|\theta^t) \approx \hat{R}_i^t(a|I)$ , where $\hat{R}_i^t(a|I) = R_i^t(a|I)/\sqrt{t}$
$f_{\text{S}}(a, I|\phi^t) \approx S_i^t(a|I)$

Learning proceeds by storing sampled regrets and strategy numerators in memory buffers, then minimizing squared losses: $\min_{\theta} \sum_{(I,a,r) \in \mathcal{M}_R} (f_{\text{R}}(a,I|\theta^{t-1}) + r - f_{\text{R}}(a,I|\theta))^2$

$\min_{\phi} \sum_{(I,a,s) \in \mathcal{M}_S} (f_{\text{S}}(a,I|\phi^{t-1}) + s - f_{\text{S}}(a,I|\phi))^2$

Regret-matching is applied to the RSN output for policy selection, and normalized ASN output for extracting the average strategy.

Monte Carlo Sampling Modifications

Mini-batch MCCFR is employed, sampling $b$ independent trajectories per iteration: $\widetilde{v}_i^\sigma(I|b) = \frac{1}{b} \sum_{j=1}^{b} \sum_{h \in I, z \in Q^j, h \sqsubseteq z} \frac{\pi_{-i}^{\sigma}(z)\pi_i^{\sigma}(h \to z)u_i(z)}{q(z)}$ Robust sampling is implemented by selecting $\min(k, |A(I)|)$ actions at each info-set uniformly. Special cases are external sampling ( $k=|A(I)|$ ) and outcome sampling ( $k=1$ ). Mini-batch CFR $+$ applies a positive regret update: $\widetilde{R}_i^{t,+}(a|I|b) = \left[\widetilde{R}_i^{t-1,+}(a|I|b) + (\widetilde{v}_i^{\sigma^t}(I,a|b) - \widetilde{v}_i^{\sigma^t}(I|b))\right]^+$

3. Iterative Algorithm and Numerical Procedures

A DoubleRegretNet iteration proceeds as follows:

Initialization: If warm-starting, the networks approximate tabular regret/strategy via batch regression; otherwise, parameters are random.
Sampling: Mini-batch MCCFR is executed under current $\sigma^{\text{rs}(k)}$ , collecting regret and strategy samples in memory buffers $\mathcal{M}_R$ , $\mathcal{M}_S$ .
Optimization: Neural parameters are updated via squared-loss regression with Adam optimizer, learning rate scheduling, gradient clipping, and early stopping.
Recursion: A single-player recursive MCCFR-NN routine handles trajectory sampling, regret/strategy computation, sample storage, and return value propagation.
Termination: After $T$ iterations, final networks encode the learned policies.

4. Parameterization and Training Protocols

Hyperparameters adopted in experimental evaluations include:

Adam optimizer with initial learning rate 1e–3, batch size 256.
Scheduler halves learning rate after 10/15 epochs without improvement.
Minimal learning rate set to 1e–6.
Early stopping when loss falls below 1e–4 (RSN) or 1e–5 (ASN).
Maximum epochs: 2000; gradient clipping: $[-1,1]$ .
LSTM hidden sizes $d \in \{8,16\}$ yield 1048 or 2608 total network parameters.
Mini-batch MCCFR with $b = 5000$ trajectories per iteration.
Robust sampling parameter $k = 3$ (large games); $k = \max$ for full external sampling.

5. Empirical Performance and Compression

Experiments conducted on One-Card Poker (5 cards) and no-limit Leduc Hold'em (stacks 5, 10, 15; up to $2 \times 10^7$ states, $3.7 \times 10^6$ information sets) use exploitability (Nash equilibrium gap) as the primary metric. Principal findings:

Mini-batch sizes ( $b=5000$ ) yield nearly identical exploitability to $b=10000$ .
Robust sampling ( $k=3$ ) matches the efficacy of external sampling with greatly reduced memory consumption.
Double neural MCCFR achieves exploitability $\approx 0.06$ in $\sim 200$ iterations, surpassing XFP ( $\sim10^3$ ) and NFSP ( $\sim10^6$ ); matches tabular CFR+ at $\approx 0.02$ exploitability by $1000$ iterations.
Ablation reveals that using only RSN or only ASN slightly degrades but does not break parity with tabular methods.
When warm-started from poor tabular strategies, DoubleRegretNet consistently reduces exploitability further.
Generalization persists: under 3–13% info-set visitation per iteration, exploitability falls below $0.1$ within $1000$ iterations.
Compression: the parameter count ( $\sim10^3$ ) is orders of magnitude lower than > $10^6$ table entries with equivalent performance.
Scalability: operates effectively on games with up to $2 \times 10^7$ states and sub-percent visitation rates per iteration.
Architectural choices: LSTM plus attention units outperform plain RNN or fully connected architectures in learning stability and final performance.

6. Contextual Impact and Significance

DoubleRegretNet associates the convergence guarantees of CFR with efficient generalization and memory compression via neural function approximation. It enables application to large imperfect information games that were previously out of reach for tabular CFR, and demonstrates significant improvement over deep RL-based self-play approaches (NFSP), both in convergence rates and exploitability. A plausible implication is that DoubleRegretNet enables continual refinement of strategies and model scaling in extensive form games by leveraging the synergy between deep learning architectures and classical regret minimization principles (1812.10607).

DoubleRegretNet’s robust sampling, mini-batch MCCFR, and MCCFR+ techniques are independently useful for scalable game-solving. The framework generalizes across outcome sampling, external sampling, and robust sampling regimes depending on $k$ . Its architecture—LSTM embedding with attention—demonstrates better representational power for encoding variable-length game prefixes and is applicable in other large-scale imperfect-information game settings.

DoubleRegretNet bridges tabular CFR’s theoretical guarantees with the practical strengths of neural networks, contributing to state-of-the-art algorithms for imperfect information games and informing follow-on research in both neural game solvers and regret minimization paradigms.

Markdown Report Issue Upgrade to Chat

References (1)

Double Neural Counterfactual Regret Minimization (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DoubleRegretNet.