DCFR: Discounted Counterfactual Regret Minimization

Updated 14 April 2026

DCFR is a family of algorithms that applies temporal discounting to both regret and strategy averaging to accelerate equilibrium convergence in extensive-form games.
It modulates the influence of historical regrets by selectively discounting positive and negative values, rapidly diminishing the impact of past mistakes.
Empirical results indicate that using parameters (1.5, 0, 2) can achieve 2-3× faster convergence than CFR+ in high-variance settings such as poker.

Discounted Counterfactual Regret Minimization (DCFR) is a family of algorithms designed to accelerate convergence in extensive-form imperfect-information games by introducing temporal discounting into both regret and strategy averaging computations. DCFR directly generalizes Counterfactual Regret Minimization (CFR) and its variant CFR+, which are foundational for equilibrium computation in large extensive-form games such as poker. By modulating the influence of past iterations through discounting factors, DCFR enables the rapid decay of large historical mistakes and the attenuation of early strategy contributions, greatly improving empirical performance over its predecessors across a spectrum of benchmarks, particularly in high-variance gaming domains (Brown et al., 2018, Zhang et al., 2024, Xu et al., 2024, Xu et al., 11 Nov 2025).

1. Mathematical Formulation and Algorithmic Structure

DCFR parameterizes three discount exponents: $\alpha$ for positive regrets, $\beta$ for negative regrets, and $\gamma$ for strategy averaging. Let $R^t_i(I, a)$ denote the (possibly signed) cumulative regret for player $i$ at information set $I$ and action $a$ at iteration $t$ . Let $r^t_i(I, a)$ be the instantaneous counterfactual regret, and $C^t_i(I, a)$ the cumulative strategy weight. The DCFR update equations are

$\beta$ 0

$\beta$ 1

Here, $\beta$ 2 denotes the reach probability of $\beta$ 3 for player $\beta$ 4 at iteration $\beta$ 5. Regret-matching is performed on the nonnegative part $\beta$ 6, and the strategy at $\beta$ 7 is updated proportionally. Averaged strategies for output are weighted according to $\beta$ 8 (Brown et al., 2018, Zhang et al., 2024).

A summary of key DCFR hyperparameters is provided below:

Symbol	Control Target	Typical Value
$\beta$ 9	Pos. regret discount	$\gamma$ 0 -- $\gamma$ 1
$\gamma$ 2	Neg. regret discount	$\gamma$ 3 -- $\gamma$ 4
$\gamma$ 5	Averaging discount	$\gamma$ 6 -- $\gamma$ 7

The canonical parameter set for poker-like settings is $\gamma$ 8 (Brown et al., 2018).

2. Discounting Scheme and Regret-Weight Interaction

Unlike CFR, which uniformly accumulates all historical regrets, DCFR modulates the persistence of old updates. Positive regrets decay by a factor of $\gamma$ 9. Negative regrets can be decayed much more slowly or even left undiscounted (e.g., $R^t_i(I, a)$ 0). This selective decay enables extremely rapid suppression of the influence of dominated or high-mistake actions, as old outlier regrets no longer dominate subsequent iterates (Xu et al., 2024, Brown et al., 2018).

Strategy averaging is performed with weights decayed by $R^t_i(I, a)$ 1 per iteration, so that the effective average after $R^t_i(I, a)$ 2 iterations is

$R^t_i(I, a)$ 3

This polynomial weighting strongly emphasizes later iterations when $R^t_i(I, a)$ 4 is large, in contrast to CFR's uniform averaging or CFR+'s linear weighting.

3. Theoretical Guarantees and Convergence Analysis

DCFR inherits the $R^t_i(I, a)$ 5 convergence rate for $R^t_i(I, a)$ 6-Nash equilibrium in two-player zero-sum extensive-form games. Formally, if $R^t_i(I, a)$ 7 is the maximal payoff range, $R^t_i(I, a)$ 8 is the number of information sets, and $R^t_i(I, a)$ 9 is the maximal number of actions per infoset, then after $i$ 0 iterations (Brown et al., 2018, Zhang et al., 2024):

$i$ 1

Empirically, DCFR achieves substantially lower exploitability faster than CFR or CFR+ for the same number of iterations when applied to large-scale, high-variance benchmarks such as Hold'em subgames. The hyperparameterization $i$ 2 delivers 2--3 $i$ 3 faster convergence than CFR+ in these settings (Brown et al., 2018, Zhang et al., 2024, Xu et al., 2024).

4. Algorithmic Implementation and Pseudocode

A representative pseudocode outline for DCFR (Brown et al., 2018, Xu et al., 2024):

$I$ 2

Key implementation notes include in-place discounting to avoid numerical underflow, and the use of alternating updates for large-game efficiency (Brown et al., 2018).

5. Comparison to CFR, CFR+, and PCFR+; Advanced Extensions

DCFR generalizes CFR (no discounting: $i$ 4) and CFR+ (linear averaging, regret clipping). CFR+ clips regrets at zero and applies $i$ 5 weighting. DCFR introduces flexible polynomial decay and can be combined with regret clipping for variants referred to as DCFR+ (Xu et al., 2024, Xu et al., 11 Nov 2025).

Further, PDCFR+ integrates DCFR-style discounting with the predictive/optimistic step of PCFR+, performing a predictive update on top of the discounted regrets before strategy computation. This hybrid yields state-of-the-art convergence, particularly in the presence of highly uneven losses typical in large poker domains (Xu et al., 2024). In neural CFR, discounted and clipped cumulative advantages are bootstrapped in value networks to match the DCFR+ mechanism (Xu et al., 11 Nov 2025).

6. Empirical Performance, Hyperparameters, and Practical Guidelines

In poker-structured games and high-mistake normal-form games, DCFR and DCFR+ offer marked empirical speedup over both CFR+ and PCFR+, especially in early- to mid-stage learning (10 $i$ 6–10 $i$ 7 iterations). Dominated actions, if initially selected, have their adverse contribution rapidly suppressed. In small non-poker games, DCFR and PCFR+ are competitive, but the predictive-discounted hybrid (PDCFR+) ultimately yields the best performance (Zhang et al., 2024, Xu et al., 2024).

Recommended parameters for general use are $i$ 8. When pruning is desired, $i$ 9 should be increased to ensure large negative regrets for suboptimal actions (e.g., $I$ 0). DCFR is also compatible with sampling/MCCFR methods, as periodic discounting at node-touched intervals preserves its acceleration properties (Brown et al., 2018).

Recent developments—such as hyperparameter schedules (HS-DCFR)—further accelerate game-solving by dynamically adapting discounting rates; e.g., starting with large $I$ 1 and linearly decreasing. This approach establishes a new state-of-the-art in practical convergence speed, outperforming both fixed-parameter DCFR and RL-based dynamic DCFR by orders of magnitude on standard benchmarks (Zhang et al., 2024).

7. Extensions and Neural Implementations

Recent research has demonstrated that DCFR-style discounting can be efficiently integrated with neural policy/value approximation. At each update, the neural models are trained to simulate DCFR’s bootstrapped, clipped, and discounted advantage updates, and the cumulative strategy network uses polynomially downward-weighted averaging, as in the tabular version. In large-scale benchmarks, neural DCFR implementations exhibit improved convergence and adversarial robustness relative to vanilla neural CFR methods (Xu et al., 11 Nov 2025).

Integration with predictive updates, as in deep PDCFR+, combines both optimism and discounting, further improving stability and speed in settings with high variance and unbalanced payoff landscapes.

For comprehensive formal definitions, convergence theorems, and full algorithmic details see the original works: "Solving Imperfect-Information Games via Discounted Regret Minimization" (Brown et al., 2018), "Faster Game Solving via Hyperparameter Schedules" (Zhang et al., 2024), "Minimizing Weighted Counterfactual Regret with Optimistic Online Mirror Descent" (Xu et al., 2024), and "Deep (Predictive) Discounted Counterfactual Regret Minimization" (Xu et al., 11 Nov 2025).