Regret Matching⁺ (RM⁺): Scale-Invariant Online Learning
- Regret Matching⁺ (RM⁺) is a scale-invariant, parameter-free online learning algorithm that updates strategies by truncating negative regrets to zero.
- It achieves sublinear external regret (O(√T)) and effective convergence in zero-sum extensive-form games via counterfactual regret minimization.
- Enhanced variants like PRM⁺ and IREG-PRM⁺ integrate predictive and stabilization techniques to bridge theoretical optimality with robust practical performance.
Regret Matching⁺ (RM⁺) is a scale-invariant, parameter-free online learning algorithm that underlies the most effective approaches to large-scale zero-sum game solving and is a cornerstone of state-of-the-art counterfactual regret minimization (CFR) methods. RM⁺ extends classic regret matching by truncating negative regrets to zero on each step, enabling aggressive, yet provably no-regret, adaptation that is highly compatible with mirror descent and Blackwell approachability frameworks. Modern extensions—most notably Predictive RM⁺ (PRM⁺) and the scale-invariant IREG-PRM⁺—connect RM⁺ to recent advances in first-order zeroth-order optimization, achieving optimal average-iterate convergence rates and closing the gap between practical game-solving performance and convex-optimization theory (Zhang et al., 6 Oct 2025, Farina et al., 2020, Farina et al., 2023, Cai et al., 2023, Anagnostides et al., 20 Oct 2025).
1. Core Algorithmic Structure
RM⁺ operates over a finite action simplex for actions. At iteration , the algorithm updates as follows (Zhang et al., 6 Oct 2025, Farina et al., 2020, Xu et al., 2024, Farina et al., 2023):
- Step 1 (Counterfactual Regret): Given the current mixed strategy and observed utility or loss vector , compute the instantaneous counterfactual regret increment:
- Step 2 (Cumulative Positive Regret, Truncation): Cumulative regrets are updated using coordinatewise truncation:
denotes the coordinate-wise maximum with zero.
- Step 3 (Strategy Update): The next strategy is proportional to positive cumulative regrets:
This ensures that only actions with positive cumulative regret receive probability mass. The algorithm is parameter-free and maintains scale invariance since regret vectors and probabilities are only defined up to normalization.
2. Theoretical Properties and Convergence
RM⁺ achieves sublinear external regret, and its integration with CFR yields scalable approximate Nash equilibria in zero-sum extensive-form games. Key properties (Zhang et al., 6 Oct 2025, Farina et al., 2020, Xu et al., 2024, Brown et al., 2018, Farina et al., 2023, Cai et al., 2023, Anagnostides et al., 20 Oct 2025):
- External Regret Bound: For any fixed comparator , RM⁺ guarantees
0
- Ergodic/average-iterate convergence: When used within CFR, the exploitability of the average strategy converges as 1 for standard RM⁺ (Farina et al., 2020, Zhang et al., 6 Oct 2025).
- Optimality Gaps in Optimization: In constrained optimization over the simplex, alternating RM⁺ achieves 2-KKT points in 3 iterations, with possible improvement to 4 when the regret is uniformly bounded (Anagnostides et al., 20 Oct 2025).
- Exponential lower bound for vanilla RM: Without truncation, ordinary RM can require an exponential number of steps to reach approximate equilibrium in potential games, while RM⁺ avoids stalling via regret clipping (negative regrets are immediately reset to zero) (Anagnostides et al., 20 Oct 2025).
3. Connections with Mirror Descent and Blackwell Approachability
RM⁺ is formally equivalent to Online Mirror Descent (OMD) on the simplicial domain with a quadratic regularizer, projected onto the non-negative orthant (Farina et al., 2020, Xu et al., 2024). The update
5
with appropriate normalization, recovers the RM⁺ mechanism. The truncated regret update coincides with the Bregman projection under squared 6 norm. This enables a direct BPV (Blackwell, Polyak, and Vygotsky) connection: in Blackwell approachability games, selecting forced halfspaces via the current non-negative regret vector and projecting yields the same iterates as OMD with quadratic regularization (Farina et al., 2020).
Furthermore, in the predictive setting, RM⁺ underlies optimistic OMD in the non-negative orthant, and PRM⁺ is the instance where the gradient predictions are one-step lagged (Xu et al., 2024).
4. Extensions: Predictive RM⁺, IREG-PRM⁺, and Stability
- Predictive RM⁺ (PRM⁺): This variant incorporates a one-step-ahead prediction 7 of the upcoming regret, forming a “predicted” regret vector:
8
The observed utility is used to update the true regret vector (Zhang et al., 6 Oct 2025, Xu et al., 2024). PRM⁺ empirically accelerates convergence, especially under smoothly evolving loss sequences, but may exhibit only 9 worst-case rates.
- Scale-Invariant Predictive RM⁺ (IREG-PRM⁺): This recent variant guarantees that the 0-norm of the regret vector is non-decreasing by shifting the prediction-plus-old regret vector by a scalar 1 at each step so that its norm is invariant (Zhang et al., 6 Oct 2025). This sharpens the learning dynamics and enables optimal 2 average-iterate and 3 last-iterate convergence in zero-sum games, matching the best-known rates for mirror-prox and optimistic mirror descent, without introducing hyperparameters.
- Stabilization and Convergence Guarantees: While RM⁺ and PRM⁺ achieve worst-case 4 regret, their updates can be unstable (large jumps in iterates), potentially destabilizing convergence when deployed in multi-agent game dynamics (Farina et al., 2023, Cai et al., 2023). Techniques such as restarts and “chopping” the orthant (restricting the regret vector’s 5-norm to exceed a threshold) mitigate this, yielding 6 individual regret and 7 social regret in normal-form games (Farina et al., 2023).
- Smoothing for Last-Iterate Convergence: Smoothing via extragradient techniques (ExRM⁺, Smooth Predictive RM⁺) achieves provably asymptotic and 8 best-iterate convergence in last-iterate equilibrium gap for zero-sum games, with restarts yielding linear convergence under certain conditions (Cai et al., 2023).
5. Role in Counterfactual Regret Minimization (CFR) and Benchmark Applications
Within the CFR framework for imperfect-information games, RM⁺ is used as the local regret minimizer at every information set. Each decision point independently updates a local regret vector, feeding counterfactual loss signals reflecting the global state and downstream values (Farina et al., 2020, Xu et al., 2024, Brown et al., 2018). PRM⁺ (and its weighted or discounted versions) underlie Predictive CFR⁺ (PCFR⁺), Discounted CFR (DCFR), and, more recently, parameter-free, scale-invariant IREG-PRM⁺ (Zhang et al., 6 Oct 2025), all empirically demonstrating multi-order-of-magnitude speedups in exploitability convergence on diverse benchmarks (e.g., Leduc Poker, Goofspiel, Battleship, Kuhn Poker, Liar’s Dice).
In generalized self-play training for games with multiple roles, RM⁺ has been used to reweight the sampling distribution over role pairs, leading to more uniformly balanced AI strength across roles and reduced win-rate variance (Wang, 2024).
6. Comparative Analysis and Limitations
The introduction of regret clipping in RM⁺ yields crucial contrasts with ordinary RM: unlike RM, which can stall for exponentially many iterations in certain potential and identical-interest games, RM⁺ ensures a monotonic increase in the 9-norm of cumulative regret and always makes progress towards approximate KKT solutions (Anagnostides et al., 20 Oct 2025). Nonetheless, vanilla RM⁺ and PRM⁺ still exhibit suboptimal behavior in adversarial or oscillatory settings, motivating both stabilization via extragradient methods and scale-invariant modifications such as IREG-PRM⁺ to obtain best-possible convergence rates and robust last-iterate behavior (Zhang et al., 6 Oct 2025, Farina et al., 2023, Cai et al., 2023).
Empirical and theoretical analyses indicate that nonuniform averaging and predictive updates further enhance performance, but only under smoothness or regularity in loss predictions (Zhang et al., 6 Oct 2025, Farina et al., 2020, Xu et al., 2024). Weighted/discounted variants of CFR built on aggressive regret discounting or weighted averaging suceed in dominated-action-heavy or deep tree regimes (Brown et al., 2018, Xu et al., 2024).
7. Summary Table: RM⁺ and Variants (Selected Properties)
| Algorithm | Regret Rate | Averaging | Last-Iterate | Param-free/Scale-inv | Notable Feature |
|---|---|---|---|---|---|
| RM⁺ | 0 | Linear | No guarantee | Yes | Clips negative regrets |
| PRM⁺ | 1* | Linear/2 | No guarantee | Yes | Predictive, “optimistic” update |
| IREG-PRM⁺ | 3 | Linear/4 | 5 | Yes | 6-invariant regret norm |
| ExRM⁺/Smooth PRM⁺ | 7 | Any | Yes, linear | Yes | Extragradient/smoothing stabilization |
| RM (no “plus”) | Fail | Any | Fail | Yes | Can stall exponentially |
(*) 8 for average-iterate, under smooth predictable-sequence conditions (Zhang et al., 6 Oct 2025, Xu et al., 2024).
References
- "Scale-Invariant Regret Matching and Online Learning with Optimal Convergence: Bridging Theory and Practice in Zero-Sum Games" (Zhang et al., 6 Oct 2025)
- "Faster Game Solving via Predictive Blackwell Approachability: Connecting Regret Matching and Mirror Descent" (Farina et al., 2020)
- "Regret Matching+: (In)Stability and Fast Convergence in Games" (Farina et al., 2023)
- "Last-Iterate Convergence Properties of Regret-Matching Algorithms in Games" (Cai et al., 2023)
- "Convergence of Regret Matching in Potential Games and Constrained Optimization" (Anagnostides et al., 20 Oct 2025)
- "Minimizing Weighted Counterfactual Regret with Optimistic Online Mirror Descent" (Xu et al., 2024)
- "Solving Imperfect-Information Games via Discounted Regret Minimization" (Brown et al., 2018)
- "Balancing the AI Strength of Roles in Self-Play Training with Regret Matching+" (Wang, 2024)