Regret-Optimal Learning Algorithms

Updated 12 November 2025

Regret-optimal learning algorithms are defined as methods that minimize the gap between cumulative loss and the hindsight-optimal strategy, ensuring optimal performance in both worst-case and instance-specific scenarios.
These approaches leverage structured problem characteristics in online convex optimization, bandit problems, and reinforcement learning to achieve tight theoretical bounds like O(√T) and logarithmic regrets.
They employ advanced techniques such as minimax duality, variance reduction, and strategic policy switching to efficiently manage uncertainty and improve adaptive decision-making in complex environments.

Regret-optimal learning algorithms are a class of methods in online learning, sequential decision-making, and reinforcement learning whose design and performance guarantees rest on minimizing the cumulative regret with respect to the best possible action sequence, policy, or decision rule, often under strong adversarial or statistical optimality criteria. "Regret" here is precisely defined as the difference in cumulative loss (or equivalently, suboptimality in accumulated reward) compared to a suitable reference—typically the hindsight-optimal strategy. A learning algorithm is said to be regret-optimal if it matches the minimax regret lower bound for the problem class, either in worst-case (adversarial) or problem-dependent (instance-specific) form. The past decade has produced a variety of regret-optimal algorithms in online convex optimization, bandits (including structured or nonparametric), and model-free/model-based reinforcement learning, often exploiting problem structure such as smoothness, unimodality, or known policy classes.

1. Minimax Regret and Theoretical Foundations

A central theoretical framework for regret-optimality is minimax duality in online convex optimization (0903.5328). The minimax regret for a sequence of $T$ rounds is defined as

$R^*_T = \inf_{\mathrm{Alg}}\; \sup_{\ell_1, \ldots, \ell_T} \left[\sum_{t=1}^T \ell_t(x_t) - \min_{x^*} \sum_{t=1}^T \ell_t(x^*)\right]$

where each $\ell_t$ is a convex loss function, and the algorithm selects $x_t$ possibly based on past loss history. By von Neumann’s minimax theorem, this value can be re-expressed as the difference between the expected sum of per-round minimizers (with respect to the adversary’s stochastic strategy) and the best empirical loss: $R_T^* = \sup_{p_{1:T}}\left[\sum_{t=1}^T \Phi(p_t) - \mathbb{E}_{\ell_{1:T}\sim p_{1:T}} \Phi\left(\frac{1}{T}\sum_t\delta_{\ell_t}\right)\right]$ with $\Phi(p) = \min_x \mathbb{E}_{\ell\sim p}\left[\ell(x)\right]$ . The gap is characterized as the Jensen gap (or Bregman divergence) of the problem’s value functional, yielding tight upper and lower bounds for various classes:

For strongly convex losses, $R^*_T = O(\log T)$ ,
For general convex classes, $R^*_T = O(\sqrt{T} \, \mathrm{Rad}_T)$ , where $\mathrm{Rad}_T$ is the Rademacher complexity,
For expert advice (the simplex), $R^*_T = O(\sqrt{T \log N})$ .

Importantly, these bounds are independent of the specific construction of an algorithm, but algorithms such as mirror descent and follow-the-regularized-leader can be shown to match them.

2. Regret-Optimal Algorithms in Bandit and Structured Decision Problems

In classical stochastic bandits, the minimax regret is achieved by algorithms like KL-UCB and Thompson sampling:

For $K$ arms with unknown distributions, the Lai–Robbins bound provides that any uniformly good policy must satisfy

$\liminf_{T \to \infty} \frac{E[t_k(T)]}{\log T} \geq \frac{1}{I(\theta_k, \theta^*)}$

with $I(\cdot,\cdot)$ the Kullback–Leibler divergence.

In Lipschitz bandits, regret-optimal methods such as OSLB (Magureanu et al., 2014) and CKL-UCB compute adaptive upper-confidence indices based on local information structure, achieving

$R(T) \sim C(\theta) \log T$

where $C(\theta)$ is a solution to a linear program reflecting the information constraints imposed by the Lipschitz condition.

For unimodal and graph-structured bandit settings, algorithms like OSUB (Combes et al., 2014) exploit graph topology, attaining

$\limsup_{T \to \infty} \frac{R(T)}{\log T} = \sum_{(k, k^*) \in E} \frac{\mu^* - \mu_k}{I(\mu_k, \mu^*)}$

independent of the total number of arms.

For combinatorial restriction, such as the "dying experts" problem (Shayestehmanesh et al., 2019), regret-optimal rates adapt to the instance structure:

Unknown death order: $O(\sqrt{mT \ln K})$ ,
Known order: $O(\sqrt{T(m + \ln K)})$ , using efficient polynomial-time strategies (HPU/HPK) exploiting the grouping of equivalent expert permutations.

3. Regret-Optimality in Reinforcement Learning and Control

Regret-optimal learning in Markov Decision Processes (MDPs) aligns average- or cumulative-reward regret with minimax lower bounds in both model-free and model-based paradigms.

For continuing-horizon MDPs, state-of-the-art algorithms based on optimism in the face of uncertainty (OFU) and evaluating the bias function $h^*$ (Zhang et al., 2019) deliver

$R(T) = \widetilde{O}\big(\sqrt{S A H T}\big)$

with $S$ the state space, $A$ the action space, $H$ a bound on the bias span, and $T$ the time horizon. This matches the lower bounds up to log factors and improves previous methods by a key factor of $\sqrt{S}$ through tighter control over bias differences rather than absolute values.

For selecting the correct latent state representation among candidate models, OMS (Maillard et al., 2013) achieves $O(\sqrt{T})$ regret where prior work incurred suboptimal $O(T^{2/3})$ rates. This is facilitated by doubly-optimistic strategies in both model selection and policy computation.
In model-free settings for discounted MDPs, recent work (Ji et al., 2023) achieves the minimax $\widetilde{O}(\sqrt{S A T}/(1-\gamma)^{3/2})$ regret with short burn-in via variance reduction and slow-yet-adaptive policy switching.

For problems with additional structure:

Threshold-structured MDPs can be converted to bandit problems over a small policy class ("arms") (Prabuchandran et al., 2016), achieving logarithmic regret and drastically reduced computational burden compared to generic UCRL/PSRL.
In cooperative bandits, fully distributed algorithms with constant communication and optimal group and individual regret (DoE-bandit (Yang et al., 2023)) have been developed.

Regret-optimal control frameworks also extend to robust estimation and filtering by bounding pathlength regret—not just in bandits but in adaptive Kalman filtering for non-stationary time series (Goel et al., 2021).

4. Variance Reduction and Function Approximation in Regret-Optimal RL

Variance-weighted algorithms, such as VOQL (Agarwal et al., 2022), and Q-EarlySettled-LowCost (Zhang et al., 5 Jun 2025), exploit completeness and eluder dimension or use batched policy updates and careful concentration to achieve finite-sample minimax regret in model-free RL settings and function approximation:

For linear function classes with eluder dimension $d$ , VOQL achieves $O(d\sqrt{H T} + d^6 H^5)$ regret over $T$ episodes, which matches known lower bounds.
Burn-in and switching/communication cost is also minimized in recent Q-learning approaches—see Q-EarlySettled-LowCost, which achieves linear-in-state-action burn-in cost and logarithmic policy switching (Zhang et al., 5 Jun 2025).

These methods exploit sophisticated bonus and confidence mechanisms, variance-aware regression, and careful control of policy switching events to succeed where classic approaches incur large constants or suboptimal exponents.

5. Black-Box and Predict-Then-Optimize Regret-Optimal Methods

In black-box supervised predict-then-optimize paradigms (Tan et al., 12 Jun 2024), where the goal is to pick the decision maximizing a learned reward function (often under observational or bandit feedback constraints), standard mean-squared-error training can yield poor regret. A key innovation is the Empirical Soft Regret (ESR) loss: $L_{ESR,k}(\theta) = \frac{1}{n} \sum_{i=1}^n \frac{|y_i - y_{n(i)}|}{1 + \exp(k \cdot sgn(y_i - y_{n(i)}) \cdot (\hat{f}_\theta(x_i,w_i) - \hat{f}_\theta(x_{n(i)}, w_{n(i)})))}$ where $n(i)$ is a nearest-neighbor index, and $k$ is a temperature parameter. For neural models trained under ESR, asymptotically optimal regret is achieved under mild regularity conditions.

Empirically, ESR outperforms standard direct-method and causal-inference methods on challenging benchmarks in personalized healthcare and news recommendation, yielding substantially lower observed regret and higher realized policy value (Tan et al., 12 Jun 2024).

6. Practical Significance and Future Directions

Regret-optimality provides a unifying metric for sequential learning and decision-making under uncertainty, directly aligning algorithm design with statistical and control-theoretic lower bounds. The field has matured to provide methods that not only match benchmark rates in adversarial or stochastic models, but also exploit domain structure (bandit arm similarity, graph topology, MDP policy class, etc.) for improved sample and computational efficiency.

Open challenges include:

Combined statistical and algorithmic optimality (closing gaps in computational efficiency for large-scale RL and adversarial settings, as in (Liu et al., 2023)).
Extensions to general function-approximation and deep RL regimes.
Instance-aware or adaptive regret measures (e.g., pathlength, queue-length, or planning regret) that may further refine algorithm design and performance.
Robustness to non-stationary or adversarial environments, including federated and distributed protocols.

Continued progress in regret-optimal learning algorithms thus hinges on further exploitation of problem structure, tight integration with statistical complexity measures, and the development of scalable, robust, and adaptive computational methods.