Minimax Regret Learning: Foundations & Methods

Updated 13 March 2026

Minimax Regret Learning is a framework that minimizes the maximum excess loss compared to the oracle solution across diverse uncertain environments.
It spans applications in supervised learning, reinforcement learning, bandits, and online decision-making by framing the problem as a two-player zero-sum game.
Algorithmic realizations include adversarial training and dual methods, which provide uniform regret control and robustness to distribution shifts.

Minimax Regret Learning (MMR) is a foundational framework in statistical decision theory, machine learning, and reinforcement learning for achieving distributionally robust (highly uniform) out-of-sample performance in the presence of adversarial or unknown variability—whether that variability is over environment parameters, sub-populations, covariate shifts, or nonstationary feedback. MMR seeks a decision rule, policy, or estimator that minimizes the worst-case (maximum) excess loss or regret—the gap between the performance of the learner and the best-in-hindsight solution for each realization of the environment or data-generating-process—across a prescribed uncertainty set.

1. Formalization of Minimax Regret Learning

The core principle of minimax regret is to select a predictor, policy, or decision rule $f$ (or $\pi$ ) that achieves

$\min_{f}\max_{e\in\mathcal{E}}\, \left[ R(f; e) - \min_{f'} R(f'; e) \right],$

where $e$ ranges over possible adversarial choices or environmental parameters, $R(f; e)$ is the risk, loss, or negative reward of $f$ in environment $e$ , and the inner infimum is the Bayes-optimal or oracle risk for $e$ . This formulation, appearing in supervised learning, online learning, statistical estimation, bandits, and reinforcement learning, quantifies how much performance is lost by not knowing the true environment in advance.

In supervised or regression settings, $e$ may represent distributions over data, groups, or sub-populations (Mo et al., 2024, Agarwal et al., 2022, Rakhlin et al., 2013).
In reinforcement learning (RL), $e$ can parametrize the MDP, such as level parameters, reward functions, or unobserved configurations (Beukman et al., 2024, Bongole et al., 2024, Boone et al., 2024, Xu et al., 2021).
In finite-armed bandits, decision theory, or RCT analysis, $e$ may be the unknown vector of mean outcomes across arms (Joo, 4 Sep 2025).
In online learning, $e$ is the adversarially chosen sequence of losses or outcomes (Orabona et al., 2015, Eldowa et al., 2023, 0903.5328, Tsuchiya et al., 2024).

The minimax regret acts as the value of a zero-sum two-player game (learner versus Nature), and minimax duality (von Neumann, Wald, Blackwell–Girshick) ensures existence of a saddle-point equilibrium in broad settings.

Table: Core MMR Formulations Across Domains

Domain	Regret Definition	Reference Example
Supervised Learning	$L(f; e) - \min_{f'} L(f'; e)$	(Mo et al., 2024, Agarwal et al., 2022)
Multi-arm RCTs	$\max_k \theta_k - \sum_j\phi_j(X)\theta_j$ (best-arm selection)	(Joo, 4 Sep 2025)
Bandits/Online	$\sum_{t=1}^T \ell_t(a_t) - \min_a \sum_{t=1}^T \ell_t(a)$	(Orabona et al., 2015, Eldowa et al., 2023)
RL (MDP)	$V^*_e - V^\pi_e$	(Bongole et al., 2024, Beukman et al., 2024)
Group Robust Learning	$\max_g L_g(\theta) - \min_\beta L_g(\beta)$	(Mo et al., 2024)

2. Structural and Theoretical Properties

Minimax regret learning enjoys several robustness properties not matched by value- or risk-based approaches.

Uniform Regret Control: By design, MMR controls the excess loss over all environments $e$ simultaneously, independent of nuisance parameters such as group proportions, heteroscedasticity, or base-level risk (Mo et al., 2024, Agarwal et al., 2022). In contrast, distributionally robust optimization (DRO) and related methods can overemphasize high-noise or large-mass environments, leading to degraded out-of-sample robustness.
Monotonic Refinement and Bayesian Consistency: For domains characterized by structural irreducibility (e.g., partially-observable RL, irreversible environment classes), standard MMR policies can stagnate on a subset of maximal-regret states. Refined approaches (e.g., Bayesian-level-perfect MMR, BLP) monotonicallly improve regret on residual domains, yielding a policy that acts perfectly Bayesian off-equilibrium (Beukman et al., 2024).
Equivalence to Minimax Excess Risk (MER): For broad problem classes, information-theoretic and duality arguments equate the minimax regret problem to the minimax excess risk studied in supervised learning; this equivalence supports tight upper and lower bounds across bandit, contextual, and RL settings (Bongole et al., 2024, 0903.5328).

3. Algorithmic Realizations

Minimax regret algorithms are generally instantiations or refinements of the following generic framework: treat the choice of $f$ or $\pi$ as a minimizer, and Nature/environment-parameter as a maximizer, in a two-player zero-sum game.

Fictitious Play/Policy-Gradient Adversarial Training: For UED in RL, alternating policy improvement and adversarial environment generation converges to Nash minimax-regret equilibria (Beukman et al., 2024).
Optimization over Weighted Loss/Regret: In supervised group-robust learning (K-MMR), smooth majorization-minimization or dual QP methods optimize $\min_\theta \max_g$ (empirical group regret), exploiting convex-concave structure (Mo et al., 2024).
Least-Favorable Priors and Dual Methods: In best-arm selection and classical decision problems, the minimax-regret rule is the unique Bayes rule under a least-favorable prior; this dual characterization enables analytic rules and plug-in procedures (e.g., AMMR) (Joo, 4 Sep 2025).
Projected Mitigated Extended Value Iteration (PMEVI): In average-reward MDPs, efficient span-constrained planning with bias-projections yields minimax-optimal regret rates, significantly outperforming classical EVI-based algorithms (Boone et al., 2024).
Bandit and Online Regret Matching: In online learning, Tsallis/FTRL variants with carefully matched stability, penalty, and bias terms achieve the minimax $T^{2/3}$ regret in partial monitoring and feedback-graph bandits (Tsuchiya et al., 2024, Eldowa et al., 2023).
Double-Oracle for Parameter Uncertainty: For robust RL under model uncertainty, iterative expansion of restricted policy and parameter classes converges to minimax-regret strategies (Xu et al., 2021).

4. Statistical and Regret-Rate Guarantees

The minimax regret rate—how regret grows with sample size or rounds—varies by problem structure and information conditions:

Online Expert Advice/Full Information: Tight lower and upper bounds are $\Omega(\sqrt{n \ln d})$ for $n$ rounds, $d$ experts (Orabona et al., 2015).
Bandits, Graph Bandits, Partial Monitoring: Minimax regret rates interpolate between $\Theta(\sqrt{K T})$ (bandits, $K$ arms) and $\Theta(T^{2/3})$ (partial monitoring, weakly observable feedback graphs) (Eldowa et al., 2023, Tsuchiya et al., 2024). Adaptive learning-rate schemes are required for optimality in hard regimes.
Supervised Regression: For classes with polynomial $\varepsilon$ -entropy growth $p$ , regret rates are $n^{-2/(2+p)}$ for $p<2$ (equal to minimax risk), and $n^{-1/p}$ for $p>2$ (regret slower than minimax risk) (Rakhlin et al., 2013).
Group-Robust Learning: K-MMR and P-MMR estimators achieve uniform or fast local convergence rates (e.g., $O(\sqrt{(VC + \log K)/n})$ for VC-type classes), and enjoy group-portability and heteroscedasticity invariance (Mo et al., 2024).
Finite-Horizon MDPs: Regret bounds of $\tilde{O}(\sqrt{H S A T})$ are minimax-optimal, with worst-case matching lower bounds (Azar et al., 2017).
Average-Reward MDPs: PMEVI methods attain $\tilde{O}(\sqrt{\operatorname{sp}(h^*) S A T})$ with no prior knowledge of the bias span (Boone et al., 2024).
Best-Arm Multi-arm RCTs: MMR-optimal rules are strictly better than empirical-success rules for $J \geq 3$ with heteroscedastic design, with deterministic, unique decision boundaries (Joo, 4 Sep 2025).

5. Robustness, Invariance, and Pathologies

MMR methods distinguish themselves by invariance to nuisance variation and correction of specific pathologies in standard robust learning:

Robustness to Heterogeneity: MMR is insensitive to differences in group/sample sizes, group-dependent noise variance, or mixture proportions (Mo et al., 2024, Agarwal et al., 2022).
Remedying DRO's Over-conservatism: By optimizing regret rather than absolute risk under each environment, MMR avoids overfitting to high-noise or uninformative distributions which can dominate DRO solutions (Agarwal et al., 2022).
Stagnation and Refinement in RL: In RL-driven UED, classical MMR strategies can stagnate by focusing only on states or environments with irreducible regret (where further learning is impossible). The BLP refinement splits the domain iteratively, freezing behavior on high-regret levels and further reducing regret, thereby breaking stagnation (Beukman et al., 2024).
Mitigating Goal Misgeneralization: In underspecified RL with proxy and true goals, minimax expected regret (MMER) policies provably avoid misgeneralization pathologies that afflict maximum-expected-value (MEV) approaches (Sadek et al., 3 Jul 2025).

6. Practical Implementations and Empirical Results

Empirical studies across domains confirm the superiority and stability of minimax regret algorithms:

Reinforcement Learning Environments: In grid-world, minigrid (T-maze, blindfold), and lever-picking domains, ReMiDi and BLP-type approaches prevent stagnation and systematically reduce regret across all learnable levels (Beukman et al., 2024).
Best-Arm Selection: AMMR plug-in rule for one-shot multi-arm RCTs is computationally tractable, variance-adaptive, and strictly dominates empirical success selection with heteroskedastic noise (Joo, 4 Sep 2025).
Learning with Subgroup Heterogeneity: K-MMR achieves uniformly low worst-case regret in synthetic scenarios and outperforms pooled ERM, GDRO, and MMV in transplantation datasets—delivering robust generalization under distribution shift, heteroscedasticity, and equivariance (Mo et al., 2024).
Robust RL under Parameter Uncertainty: Double-oracle MIRROR achieves minimax regret policies superior to maximin and single-shot robust baselines in green security domains under adversarially chosen behavioral models (Xu et al., 2021).
Online/Partial Monitoring: Optimal $T^{2/3}$ -rate FTRL with Tsallis entropy, using stability/penalty/bias matching, achieves best-of-both-worlds performance in adversarial and stochastic regimes, with simplified rate schedules (Tsuchiya et al., 2024).

7. Open Problems, Limitations, and Future Directions

While minimax regret guarantees represent a practical and information-theoretic gold standard for robustness, several open questions remain:

Estimator Construction and Computational Efficiency: For complex, high-dimensional domains (large parameter spaces, non-convex RL tasks), efficient realization of the minimax regret can require significant algorithmic engineering and convex relaxation (Boone et al., 2024, Xu et al., 2021).
Refinement in the Presence of Irreducible Regret: Handling and quantifying the structure of environments or data where some regret is unavoidable (due to partial observability, identical observation/optimal action sets) continues to motivate refined objectives (BLP, lexicographic MMER) (Beukman et al., 2024, Sadek et al., 3 Jul 2025).
Extensions beyond Regret: While regret-based criteria handle many adversarial scenarios, extensions to other risk measures (CVaR, tail risk, distributional robustness beyond mean regret) and domain-specific constraints remain active topics (Bongole et al., 2024).
Learning under Ambiguity and Noninformative Signals: The MMR framework quantifies the fundamental price of robustness in settings where Nature can, in the large-sample limit, force ex-ante incomplete learning by keeping information arbitrarily weak—resulting in sub-exponential convergence rates and under-inference bias (Che et al., 16 Feb 2026).
Unified Game-Theoretic Treatments: The robustness guarantees and the dual relationship between minimax regret and Nash or perfect-Bayesian equilibria invite systematic study for broader classes of games, especially in partially observable or non-identifiable environments (Beukman et al., 2024).

Minimax Regret Learning provides a mathematically rigorous, highly robust criterion for learning and acting under uncertainty, generalizing and strengthening the traditional focus on mean risk, worst-case risk, or maximin performance. It is the central organizing principle for modern group-robust supervised learning, robust RL/online decision making, and distributionally robust statistical inference.