Papers
Topics
Authors
Recent
2000 character limit reached

Regret Matching⁺: Equilibrium and Convergence

Updated 27 November 2025
  • Regret Matching⁺ is a parameter-free online learning algorithm that uses only positive cumulative regrets to determine action selection in normal-form and extensive-form games.
  • Extensions like PRM⁺ and IREG-PRM⁺ incorporate predictive forecasts and scale-invariant techniques, accelerating convergence and achieving tighter regret bounds.
  • Smoothing, extragradient steps, and restarting strategies improve stability, making RM⁺ effective for robust equilibrium computation in large-scale, multi-role game scenarios.

Regret Matching⁺ (RM⁺) is a parameter-free, online learning algorithm for normal-form and extensive-form games that maintains and updates nonnegative cumulative regrets, using only their positive parts to determine the action selection distribution. RM⁺ and its predictive extensions form the core of state-of-the-art game-solving methods, particularly within counterfactual regret minimization (CFR) frameworks. Modern variants, such as predictive RM⁺ (PRM⁺) and scale-invariant extra-gradient versions (IREG-PRM⁺), address long-standing gaps between theoretical guarantees and practical convergence behavior for equilibrium computation in large-scale zero-sum and multi-role games.

1. Algorithmic Foundations of Regret Matching⁺

Regret Matching⁺ replaces the traditional regret-matching probability computation with a positive-truncation operator. At each round, for a player with action set AA and regret vector r(t)RAr^{(t)}\in\mathbb{R}^{|A|}, the RM⁺ next-step distribution is computed as: xa(t+1)={[ra(t)]+bA[rb(t)]+if b[rb(t)]+>0 1/Aotherwise.x_{a}^{(t+1)}= \begin{cases} \frac{[r_{a}^{(t)}]_{+}}{\sum_{b \in A}[r_{b}^{(t)}]_{+}} & \text{if } \sum_b [r_{b}^{(t)}]_{+} > 0 \ 1/|A| & \text{otherwise.} \end{cases} where [ra]+=max{ra,0}[r_{a}]_{+}=\max\{r_{a},0\}.

Cumulative regrets are updated via: r(t+1)=r(t)+u(t)u(t),x(t+1)1r^{(t+1)} = r^{(t)} + u^{(t)} - \langle u^{(t)}, x^{(t+1)} \rangle \mathbf{1} where u(t)u^{(t)} is the action utility vector.

Predictive RM⁺ (PRM⁺) introduces a forecast r^(t)\hat r^{(t)} and uses this in the decision distribution, increasing convergence rates when system feedback is predictable (Farina et al., 2020). IREG-PRM⁺ further enforces scale-invariance by maintaining a nondecreasing 2\ell_2-norm of the regret vector, yielding optimal convergence properties, as formalized via RVU-type bounds (Zhang et al., 6 Oct 2025).

2. Connections to Online Mirror Descent and Optimistic Learning

RM⁺ is equivalent to an instance of online mirror descent (OMD) with a Euclidean potential and simplex projections, as shown in (Liu et al., 2021, Farina et al., 2020). The update,

zt=[zt1+t,xt1t]+,  xt=[zt1]+[zt1]+1z^{t} = [z^{t-1} + \langle \ell^{t}, x^{t} \rangle \mathbf{1} - \ell^{t}]_{+}, \;\quad x^{t} = \frac{[z^{t-1}]_{+}}{ \|[z^{t-1}]_{+}\|_1 }

matches OMD on the nonnegative orthant, with the step-size determined adaptively by the norm of the positive regret vector, thus implicitly tuning the algorithm’s reactivity to observed payoffs. This link underpins the O(T)O(\sqrt{T}) regret guarantees and motivates smooth and predictive modifications, as in PRM⁺ and IREG-PRM⁺.

IREG-PRM⁺ formalizes this adaptive learning rate by ensuring r(t)2\|r^{(t)}\|_2 is nondecreasing, which acts analogously to an adaptive step-size in optimistic gradient methods. This invariance is central to matching the O(1/T)O(1/T) average iterate rate characteristic of optimal first-order methods (mirror-prox, extragradient) (Zhang et al., 6 Oct 2025).

3. Stability, Smoothing, and Convergence Pathologies

Vanilla RM⁺ and PRM⁺ can suffer from instability, including oscillating last iterates and slow convergence in pathological zero-sum games with unique Nash equilibria. RM⁺’s regret operator lacks Lipschitzness and monotonicity, precluding application of standard variational inequality theories for last-iterate convergence (Cai et al., 2023, Farina et al., 2023). Numerical and theoretical evidence shows that:

  • RM⁺ and PRM⁺ can stagnate at O(1/T)O(1/\sqrt{T}) last-iterate duality gap in games with small action sets (e.g., 3×33\times3 matrix games).
  • Alternating PRM⁺ sometimes empirically exhibits last-iterate convergence, but this behavior lacks theoretical justification.

To address these deficiencies, smoothing schemes such as extragradient RM⁺ (ExRM⁺) and smooth predictive RM⁺ (SPRM⁺) introduce predictive or extragradient steps with explicit step sizes and domain projections, recovering asymptotic last-iterate convergence. Restarting approaches and projection domain modifications (“chopping off” the orthant near the origin) can guarantee stability and drive both individual and social regret to O(T1/4)O(T^{1/4}) or O(1)O(1), respectively (Cai et al., 2023, Farina et al., 2023).

4. Extensions: Predictive, Extra-Gradient, and Scale-Invariant Methods

Table: Summary of Major RM⁺ Variants and Their Regret/Convergence Properties

Algorithm Key Innovation Regret / Nash Gap
RM⁺ Positive truncation O(1/T)O(1/\sqrt{T})
PRM⁺ Optimistic (predicted) update O(tmt22)O\left(\sqrt{\sum \|\ell^t-m^t\|_2^2}\right)
ExRM⁺/SPRM⁺ Extragradient/smooth projection Linear last-iterate with restart, O(1/T)O(1/\sqrt{T}) best-iterate
IREG-PRM⁺ Norm nondecrease, scale-invariance O(1/T)O(1/T) average, O(1/T)O(1/\sqrt{T}) last-iterate

IREG-PRM⁺ (Zhang et al., 6 Oct 2025) guarantees:

  • r~(t)2\|\tilde{r}^{(t)}\|_2 nondecreasing at each iteration, acting as a step-size controller.
  • RVU-type regret bounds decouple learning rate from hyperparameter tuning, facilitating O(1/T)O(1/T) average Nash gap without sacrificing parameter-freeness or computational simplicity.
  • Empirical performance matching or exceeding predictive CFR⁺ (PCFR) on games such as Kuhn poker, Leduc hold’em, Battleship, and Goofspiel.

Smoothing-based approaches (ExRM⁺, SPRM⁺) require only modest algorithmic modification—extragradient steps, prediction, or restarting—yet yield provable last-iterate convergence and practical stability, as formalized in (Cai et al., 2023).

5. Applications: Large-Scale Self-Play and Role-Balanced Training

RM⁺ is prominent in large-scale self-play training for generalist game AI, especially in multi-role environments. In such contexts, a generalized model trained via naive uniform self-play often displays uneven proficiency across roles. Adapting RM⁺ to operate on the N×N matrix of role-pairs (for NN roles), as in “Balancing the AI Strength of Roles in Self-Play Training with Regret Matching⁺” (Wang, 23 Jan 2024), enables automatic allocation of training focus:

  • Positive "regret" matrices for each role-pair drive adaptive changes in the data-sampling distribution.
  • An exponentially-smoothed win-rate matrix with mixing floor ensures persistent coverage of all pairs and prevents mode collapse.
  • Empirical results (e.g., in a 13-character fighting game) show a ≈43% reduction in win-rate variance across roles when using RM⁺, yielding a more uniformly strong policy compared to baseline self-play.
  • Complexity per update is O(N2)O(N^2); with grouping or subsampling practical for large NN.

Modifications, such as smoothing factors (γ) and per-pairation mixing (η), enhance robustness and balance, retaining sublinear regret guarantees for the sampler.

6. Theoretical Properties and Limitations

RM⁺, PRM⁺, and IREG-PRM⁺ have been extensively analyzed:

  • Standard RM⁺ (and CFR⁺) provably achieve worst-case O(T)O(\sqrt{T}) regret and O(1/T)O(1/\sqrt{T}) Nash-gap (Farina et al., 2020, Liu et al., 2021).
  • Predictive (optimistic) RM⁺ attains near-constant regret and accelerated equilibrium convergence when loss sequences are predictable (Farina et al., 2020).
  • IREG-PRM⁺ achieves the theoretically optimal O(1/T)O(1/T) average and O(1/T)O(1/\sqrt{T}) best-iterate duality gap, closing the gap to classic extragradient algorithms (Zhang et al., 6 Oct 2025).
  • Smoothing and restarting are necessary to achieve linear last-iterate convergence in general; without such mechanisms, raw RM⁺ dynamics may stagnate or oscillate (Cai et al., 2023, Farina et al., 2023).
  • Stability fixes (restart, orthant-chopping) guarantee improved convergence for both individual and social regret, as well as robustness in extensive-form games (Farina et al., 2023).

7. Empirical Performance and Practical Guidance

Empirical benchmarks demonstrate that:

  • Predictive RM⁺ (PCFR) outperforms all prior non-predictive CFR variants, often by orders of magnitude in Nash gap on non-poker zero-sum benchmarks (Farina et al., 2020).
  • IREG-PRM⁺ is a drop-in replacement with no new hyperparameters and matches or improves upon PRM⁺, DCFR, and adaptive OGD variants across matrix and extensive-form games (Zhang et al., 6 Oct 2025).
  • Smoothing and restarts do not harm performance and may accelerate convergence in practice (Cai et al., 2023).
  • For symmetric, multi-role environments, regret-matching-based role-balancing achieves uniform competence while avoiding catastrophic neglect of under-trained roles (Wang, 23 Jan 2024).
  • Parameter-freeness and scale-invariance facilitate deployment in practical, large-scale scenarios with minimal tuning.

A plausible implication is that the underlying principle—adaptive, scale-invariant control of the dual variable norm—may inform future algorithmic advances for fast, robust equilibrium computation in high-dimensional, structured games.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Regret Matching+.