DoubleRegretNet: Neural CFR Framework
- DoubleRegretNet is a double neural network approach that approximates cumulative regret and average strategies using RSN and ASN, respectively.
- It leverages mini-batch MCCFR, robust sampling, and MCCFR+ to match near tabular CFR+ performance with significantly fewer parameters.
- The framework achieves fast convergence and scalability in imperfect information games, effectively bridging classical CFR with modern deep learning.
DoubleRegretNet is a double neural network framework for counterfactual regret minimization in imperfect information games. It utilizes two distinct neural networks—RegretSumNetwork (RSN) and AvgStrategyNetwork (ASN)—to approximate the cumulative regret and average strategy profile, respectively, thereby overcoming the scalability limitations of classical tabular CFR and enabling continual improvement from non-optimal initializations. The architecture incorporates robust sampling, mini-batch Monte Carlo Counterfactual Regret Minimization (MCCFR), and MCCFR+, yielding near tabular CFR+ performance with two orders of magnitude fewer parameters (1812.10607).
1. Double Neural Architecture
DoubleRegretNet consists of two parameterized networks, each mapping information sets and legal actions to scalar outputs:
- RegretSumNetwork (RSN): Approximates normalized cumulative regret via a function .
- AvgStrategyNetwork (ASN): Approximates the cumulative numerator of the average strategy via .
Both networks receive a variable-length, embedded history prefix using a one-layer LSTM (hidden dimension ), followed by an attention mechanism over LSTM outputs to yield a context vector, which then passes through a small feed-forward “value head” (ReLU to linear). The input encoding includes private card (one-hot), public cards (one-hot or zero-vector), and the last action (one-hot, including “cumulative spent”). Each network typically requires only – parameters, enabling significant compression relative to tabular representations.
2. Mathematical Formalization
Standard CFR Definitions
Let denote the counterfactual value of player at history , the instantaneous regret, and the cumulative regret after iterations:
where is the cumulative numerator for the average strategy.
Neural Approximations
Regret and average strategy values are stored in network weights instead of large tables:
- , where
Learning proceeds by storing sampled regrets and strategy numerators in memory buffers, then minimizing squared losses:
Regret-matching is applied to the RSN output for policy selection, and normalized ASN output for extracting the average strategy.
Monte Carlo Sampling Modifications
Mini-batch MCCFR is employed, sampling independent trajectories per iteration: Robust sampling is implemented by selecting actions at each info-set uniformly. Special cases are external sampling () and outcome sampling (). Mini-batch CFR applies a positive regret update:
3. Iterative Algorithm and Numerical Procedures
A DoubleRegretNet iteration proceeds as follows:
- Initialization: If warm-starting, the networks approximate tabular regret/strategy via batch regression; otherwise, parameters are random.
- Sampling: Mini-batch MCCFR is executed under current , collecting regret and strategy samples in memory buffers , .
- Optimization: Neural parameters are updated via squared-loss regression with Adam optimizer, learning rate scheduling, gradient clipping, and early stopping.
- Recursion: A single-player recursive MCCFR-NN routine handles trajectory sampling, regret/strategy computation, sample storage, and return value propagation.
- Termination: After iterations, final networks encode the learned policies.
4. Parameterization and Training Protocols
Hyperparameters adopted in experimental evaluations include:
- Adam optimizer with initial learning rate 1e–3, batch size 256.
- Scheduler halves learning rate after 10/15 epochs without improvement.
- Minimal learning rate set to 1e–6.
- Early stopping when loss falls below 1e–4 (RSN) or 1e–5 (ASN).
- Maximum epochs: 2000; gradient clipping: .
- LSTM hidden sizes yield 1048 or 2608 total network parameters.
- Mini-batch MCCFR with trajectories per iteration.
- Robust sampling parameter (large games); for full external sampling.
5. Empirical Performance and Compression
Experiments conducted on One-Card Poker (5 cards) and no-limit Leduc Hold'em (stacks 5, 10, 15; up to states, information sets) use exploitability (Nash equilibrium gap) as the primary metric. Principal findings:
- Mini-batch sizes () yield nearly identical exploitability to .
- Robust sampling () matches the efficacy of external sampling with greatly reduced memory consumption.
- Double neural MCCFR achieves exploitability in iterations, surpassing XFP () and NFSP (); matches tabular CFR+ at exploitability by $1000$ iterations.
- Ablation reveals that using only RSN or only ASN slightly degrades but does not break parity with tabular methods.
- When warm-started from poor tabular strategies, DoubleRegretNet consistently reduces exploitability further.
- Generalization persists: under 3–13% info-set visitation per iteration, exploitability falls below $0.1$ within $1000$ iterations.
- Compression: the parameter count () is orders of magnitude lower than > table entries with equivalent performance.
- Scalability: operates effectively on games with up to states and sub-percent visitation rates per iteration.
- Architectural choices: LSTM plus attention units outperform plain RNN or fully connected architectures in learning stability and final performance.
6. Contextual Impact and Significance
DoubleRegretNet associates the convergence guarantees of CFR with efficient generalization and memory compression via neural function approximation. It enables application to large imperfect information games that were previously out of reach for tabular CFR, and demonstrates significant improvement over deep RL-based self-play approaches (NFSP), both in convergence rates and exploitability. A plausible implication is that DoubleRegretNet enables continual refinement of strategies and model scaling in extensive form games by leveraging the synergy between deep learning architectures and classical regret minimization principles (1812.10607).
7. Related Methodologies and Extensions
DoubleRegretNet’s robust sampling, mini-batch MCCFR, and MCCFR+ techniques are independently useful for scalable game-solving. The framework generalizes across outcome sampling, external sampling, and robust sampling regimes depending on . Its architecture—LSTM embedding with attention—demonstrates better representational power for encoding variable-length game prefixes and is applicable in other large-scale imperfect-information game settings.
DoubleRegretNet bridges tabular CFR’s theoretical guarantees with the practical strengths of neural networks, contributing to state-of-the-art algorithms for imperfect information games and informing follow-on research in both neural game solvers and regret minimization paradigms.