Regret Matching⁺: Equilibrium and Convergence
- Regret Matching⁺ is a parameter-free online learning algorithm that uses only positive cumulative regrets to determine action selection in normal-form and extensive-form games.
- Extensions like PRM⁺ and IREG-PRM⁺ incorporate predictive forecasts and scale-invariant techniques, accelerating convergence and achieving tighter regret bounds.
- Smoothing, extragradient steps, and restarting strategies improve stability, making RM⁺ effective for robust equilibrium computation in large-scale, multi-role game scenarios.
Regret Matching⁺ (RM⁺) is a parameter-free, online learning algorithm for normal-form and extensive-form games that maintains and updates nonnegative cumulative regrets, using only their positive parts to determine the action selection distribution. RM⁺ and its predictive extensions form the core of state-of-the-art game-solving methods, particularly within counterfactual regret minimization (CFR) frameworks. Modern variants, such as predictive RM⁺ (PRM⁺) and scale-invariant extra-gradient versions (IREG-PRM⁺), address long-standing gaps between theoretical guarantees and practical convergence behavior for equilibrium computation in large-scale zero-sum and multi-role games.
1. Algorithmic Foundations of Regret Matching⁺
Regret Matching⁺ replaces the traditional regret-matching probability computation with a positive-truncation operator. At each round, for a player with action set and regret vector , the RM⁺ next-step distribution is computed as: where .
Cumulative regrets are updated via: where is the action utility vector.
Predictive RM⁺ (PRM⁺) introduces a forecast and uses this in the decision distribution, increasing convergence rates when system feedback is predictable (Farina et al., 2020). IREG-PRM⁺ further enforces scale-invariance by maintaining a nondecreasing -norm of the regret vector, yielding optimal convergence properties, as formalized via RVU-type bounds (Zhang et al., 6 Oct 2025).
2. Connections to Online Mirror Descent and Optimistic Learning
RM⁺ is equivalent to an instance of online mirror descent (OMD) with a Euclidean potential and simplex projections, as shown in (Liu et al., 2021, Farina et al., 2020). The update,
matches OMD on the nonnegative orthant, with the step-size determined adaptively by the norm of the positive regret vector, thus implicitly tuning the algorithm’s reactivity to observed payoffs. This link underpins the regret guarantees and motivates smooth and predictive modifications, as in PRM⁺ and IREG-PRM⁺.
IREG-PRM⁺ formalizes this adaptive learning rate by ensuring is nondecreasing, which acts analogously to an adaptive step-size in optimistic gradient methods. This invariance is central to matching the average iterate rate characteristic of optimal first-order methods (mirror-prox, extragradient) (Zhang et al., 6 Oct 2025).
3. Stability, Smoothing, and Convergence Pathologies
Vanilla RM⁺ and PRM⁺ can suffer from instability, including oscillating last iterates and slow convergence in pathological zero-sum games with unique Nash equilibria. RM⁺’s regret operator lacks Lipschitzness and monotonicity, precluding application of standard variational inequality theories for last-iterate convergence (Cai et al., 2023, Farina et al., 2023). Numerical and theoretical evidence shows that:
- RM⁺ and PRM⁺ can stagnate at last-iterate duality gap in games with small action sets (e.g., matrix games).
- Alternating PRM⁺ sometimes empirically exhibits last-iterate convergence, but this behavior lacks theoretical justification.
To address these deficiencies, smoothing schemes such as extragradient RM⁺ (ExRM⁺) and smooth predictive RM⁺ (SPRM⁺) introduce predictive or extragradient steps with explicit step sizes and domain projections, recovering asymptotic last-iterate convergence. Restarting approaches and projection domain modifications (“chopping off” the orthant near the origin) can guarantee stability and drive both individual and social regret to or , respectively (Cai et al., 2023, Farina et al., 2023).
4. Extensions: Predictive, Extra-Gradient, and Scale-Invariant Methods
Table: Summary of Major RM⁺ Variants and Their Regret/Convergence Properties
| Algorithm | Key Innovation | Regret / Nash Gap |
|---|---|---|
| RM⁺ | Positive truncation | |
| PRM⁺ | Optimistic (predicted) update | |
| ExRM⁺/SPRM⁺ | Extragradient/smooth projection | Linear last-iterate with restart, best-iterate |
| IREG-PRM⁺ | Norm nondecrease, scale-invariance | average, last-iterate |
IREG-PRM⁺ (Zhang et al., 6 Oct 2025) guarantees:
- nondecreasing at each iteration, acting as a step-size controller.
- RVU-type regret bounds decouple learning rate from hyperparameter tuning, facilitating average Nash gap without sacrificing parameter-freeness or computational simplicity.
- Empirical performance matching or exceeding predictive CFR⁺ (PCFR) on games such as Kuhn poker, Leduc hold’em, Battleship, and Goofspiel.
Smoothing-based approaches (ExRM⁺, SPRM⁺) require only modest algorithmic modification—extragradient steps, prediction, or restarting—yet yield provable last-iterate convergence and practical stability, as formalized in (Cai et al., 2023).
5. Applications: Large-Scale Self-Play and Role-Balanced Training
RM⁺ is prominent in large-scale self-play training for generalist game AI, especially in multi-role environments. In such contexts, a generalized model trained via naive uniform self-play often displays uneven proficiency across roles. Adapting RM⁺ to operate on the N×N matrix of role-pairs (for roles), as in “Balancing the AI Strength of Roles in Self-Play Training with Regret Matching⁺” (Wang, 23 Jan 2024), enables automatic allocation of training focus:
- Positive "regret" matrices for each role-pair drive adaptive changes in the data-sampling distribution.
- An exponentially-smoothed win-rate matrix with mixing floor ensures persistent coverage of all pairs and prevents mode collapse.
- Empirical results (e.g., in a 13-character fighting game) show a ≈43% reduction in win-rate variance across roles when using RM⁺, yielding a more uniformly strong policy compared to baseline self-play.
- Complexity per update is ; with grouping or subsampling practical for large .
Modifications, such as smoothing factors (γ) and per-pairation mixing (η), enhance robustness and balance, retaining sublinear regret guarantees for the sampler.
6. Theoretical Properties and Limitations
RM⁺, PRM⁺, and IREG-PRM⁺ have been extensively analyzed:
- Standard RM⁺ (and CFR⁺) provably achieve worst-case regret and Nash-gap (Farina et al., 2020, Liu et al., 2021).
- Predictive (optimistic) RM⁺ attains near-constant regret and accelerated equilibrium convergence when loss sequences are predictable (Farina et al., 2020).
- IREG-PRM⁺ achieves the theoretically optimal average and best-iterate duality gap, closing the gap to classic extragradient algorithms (Zhang et al., 6 Oct 2025).
- Smoothing and restarting are necessary to achieve linear last-iterate convergence in general; without such mechanisms, raw RM⁺ dynamics may stagnate or oscillate (Cai et al., 2023, Farina et al., 2023).
- Stability fixes (restart, orthant-chopping) guarantee improved convergence for both individual and social regret, as well as robustness in extensive-form games (Farina et al., 2023).
7. Empirical Performance and Practical Guidance
Empirical benchmarks demonstrate that:
- Predictive RM⁺ (PCFR) outperforms all prior non-predictive CFR variants, often by orders of magnitude in Nash gap on non-poker zero-sum benchmarks (Farina et al., 2020).
- IREG-PRM⁺ is a drop-in replacement with no new hyperparameters and matches or improves upon PRM⁺, DCFR, and adaptive OGD variants across matrix and extensive-form games (Zhang et al., 6 Oct 2025).
- Smoothing and restarts do not harm performance and may accelerate convergence in practice (Cai et al., 2023).
- For symmetric, multi-role environments, regret-matching-based role-balancing achieves uniform competence while avoiding catastrophic neglect of under-trained roles (Wang, 23 Jan 2024).
- Parameter-freeness and scale-invariance facilitate deployment in practical, large-scale scenarios with minimal tuning.
A plausible implication is that the underlying principle—adaptive, scale-invariant control of the dual variable norm—may inform future algorithmic advances for fast, robust equilibrium computation in high-dimensional, structured games.