Problem-Dependent Regret Measure

Updated 22 January 2026

Problem-dependent regret measures are defined as adaptive performance bounds that scale with instance-specific characteristics such as gaps, variances, and gradient variations.
These measures span fields like online convex optimization, bandits, and reinforcement learning, offering tighter and more informative guarantees than uniform worst-case bounds.
Methodologies including algorithmic adaptivity and dynamic belief revision enable systems to differentiate between easy and hard problem instances based on intrinsic complexity.

A problem-dependent regret measure is any notion of regret in sequential decision-making or online learning whose upper or lower bound is not uniform over all problem instances, but instead depends on specific parameters, structure, or observed data of the instance at hand. This category encompasses a wide range of formally distinct regret functionals and instance-dependent complexity measures developed across online convex optimization, bandits, reinforcement learning, decision theory, and control. Problem-dependent regret arises whenever the measure of learner performance adaptively scales with aspects such as gaps, variances, loss sequences, gradients, or ambiguity weights, thereby sharply distinguishing "easy" from "hard" problem instances.

1. Definitions and Formal Criteria

In classical minimax analysis, regret is typically bounded in terms of adversarial or worst-case instance parameters (e.g., time horizon $T$ , number of arms $K$ , dimension $d$ , horizon $H$ ) with no reference to special characteristics of a given problem instance. In contrast, a regret measure is termed problem- or instance-dependent if it scales with observable or structural parameters that vary meaningfully between instances:

Gap-dependent regret: Depends on gaps between optimal and suboptimal actions or policies, e.g., $\sum_{i: \Delta_i>0}\Delta_i / KL(\mu_i,\mu^*)$ in bandits, or policy/action gaps in MDPs (Tirinzoni et al., 2021, Fei et al., 2022).
Variance-dependent regret: Scales with empirical variances of the reward/noise, e.g., $\widetilde O(d\sqrt{\sum_k \sigma_k^2})$ for bandits/RL (Zhao et al., 2023).
Gradient-variation/variation-based regret: In online convex optimization, the regret is bounded by cumulative changes in the gradients or variations of loss functions, e.g., $O(\sqrt{V_T})$ where $V_T = \sum_{t=2}^T \sup_x \|\nabla f_t(x) - \nabla f_{t-1}(x)\|^2$ (Zhao et al., 25 Nov 2025, Zhao et al., 2021).
Small-loss/adaptive regret: Scales with cumulative loss of the best comparator or adaptive comparators over intervals, e.g., $O(\sqrt{L_r^s})$ where $L_r^s$ is interval cumulative loss (Zhang et al., 2019).
Data-dependent regret: Explicitly depends on the realized loss sequence, often via empirical range, variance, or structure of the observed data (Gokcesu et al., 2023, Genalti et al., 26 May 2025).
Ambiguity-weighted regret: Uses a weighted set of priors (e.g., MWER), where bounds vary depending on the agent's subjective confidence in model components (Halpern et al., 2013).
Instance-complexity based regret: Scales with intrinsic measures such as instance-specific effective dimension, shell packings, or other instance-wise complexity terms (e.g., kernelized bandits with shell packing sum $C(\Delta)$ ) (Shekhar et al., 2022).

By incorporating such quantities, the problem-dependent regret measure captures the true difficulty of an instance, potentially providing tighter and more informative theoretical and algorithmic guarantees than minimax analysis.

2. Key Examples Across Frameworks

Decision Theory and MWER

Minimax Weighted Expected Regret (MWER):

$\mathrm{MWER}_{M, P^+}(f) = \sup_{\mu \in P} \alpha_\mu \cdot \mathbb{E}_\mu\left[\max_{g \in M} u(g(s)) - u(f(s))\right]$

Here, the $\alpha_\mu$ encode problem-dependent beliefs (e.g., based on expert knowledge, historical data, or likelihood updates), producing an adaptive regret criterion. Likelihood updating of the $\alpha_\mu$ ensures that, as data accrues, MWER converges to subjective expected utility under the true generating measure, adapting to each instance (Halpern et al., 2013).

Online Learning and Bandits

Range- and variance-based regret: For adversarial bandits or experts, data-dependent regret can be written as $R_T \leq A\max_t\Delta_t + B\sqrt{\sum_{t=1}^T\Delta_t^2}$ , where $\Delta_t$ are realized per-round loss ranges, capturing the problem's heteroscedasticity or observed variability (Gokcesu et al., 2023).
Kernelized bandits, instance-complexity: Lower bound and algorithmic guarantee are determined by $C(\Delta) = \sum_k m_k/(2^{k+2}\Delta)$ , where $m_k$ are packing numbers of shells defined by the function's value structure around its maximizer. Thus the regret interpolation strongly depends on instance geometry and the function's local behavior (Shekhar et al., 2022).
Constrained MABs, decomposed regret: Regret bounds take the form $O\left(\frac{K}{\rho}\sqrt{\sum_t (\ell_t^\top(x^\diamond - x^*))^2} + \frac{1}{\rho}\sqrt{K\sum_t \ell_t^\top x^*}\right)$ , where the first term is a data-dependent safety complexity and the second is a bandit-complexity, both problem-specific (Genalti et al., 26 May 2025).

Reinforcement Learning

Gap-based lower bounds for MDPs: Problem-dependent lower bounds for regret in MDPs are given by $\min_{\eta\ge0} \sum_{s,a,h} \eta_h(s,a)\Delta_h(s,a)$ subject to information-theoretic and flow constraints determined by the MDP's structure, with the solution scaling with policy gaps and dynamics (Tirinzoni et al., 2021).
Variance/environmental norm in RL: High-probability regret bounds of the form $\tilde{O}(\sqrt{S A T} + \sigma_*)$ , where $\sigma_*$ is the "environmental norm," (maximum conditional variance of one-step value) adaptively interpolate between worst-case and easy instances with flat or deterministic transitions (Zanette et al., 2019).
Variance-dependent regrets in linear RL: Horizon-free, problem-specific regret scaling as $O\left(d\sqrt{\mathrm{Var}_K^*} + d^2\right)$ , where $\mathrm{Var}_K^*$ is the cumulative variance under the optimal policy (Zhao et al., 2023).
Gap-dependent, risk-sensitive RL: "Cascaded gaps" generalize risk-neutral gap analysis to risk-sensitive settings, controlling regret as $O\left((e^{|\beta|H} - 1)^2/(\beta^2\,\Delta_{\min,\beta})\right)$ , where $\Delta_{\min,\beta}$ encodes the cascaded problem-dependent gaps related to the nonlinearity and risk parameter $\beta$ (Fei et al., 2022).

Online Convex Optimization

Gradient-variation regret: Adaptive regret bounds such as $O(\sqrt{V_T})$ (gradient-variation), $O(\sqrt{L_r^s})$ (interval-comparator loss), and best-of-small-loss or best-of-gradient-variance rates provide instance-adaptive learning guarantees (Zhao et al., 25 Nov 2025, Zhao et al., 2021, Zhang et al., 2019, Gokcesu et al., 2023).

3. Methodologies for Achieving Problem-Dependent Regret

Several methodological paradigms are regularly employed to realize or analyze problem-dependent regret:

Algorithmic adaptivity: Design of algorithms (often via second-order, scale-free, or meta-learning techniques) that do not require prior knowledge of the problem-dependent quantity but automatically adapt to its realized value (e.g., gradient-var adaptive OCO, data-adaptive bandits, likelihood-updating of ambiguity weights) (Zhao et al., 25 Nov 2025, Zhang et al., 2019, Genalti et al., 26 May 2025, Halpern et al., 2013).
Information-theoretic constraints: Lower bounds derived via change-of-measure or KL-based arguments where the required "exploration budget" per instance arises from the instance's intrinsic statistical or structural distinguishability properties (Tirinzoni et al., 2021, Tranos et al., 2021).
Instance-specific complexity measures: Introduction of explicit complexity measures (e.g., environmental norm $\sigma_*$ , shell-packing numbers, cascaded gaps, path length, or comparator class complexity $W$ ) which capture the structural elements governing the learning challenge in a concrete instance (Zanette et al., 2019, Shekhar et al., 2022, Fei et al., 2022, Gokcesu et al., 2023).
Dynamic updating and belief revision: Bayesian, likelihood-based, or menu-expansion-weight updating schemes ensure that regret adapts as learning progresses (e.g., MWER's weight updating, dynamic ensemble OCO frameworks) (Halpern et al., 2013, Zhao et al., 2021).

4. Axiomatic and Structural Foundations

Problem-dependent regret measures often require nuanced axiomatization to justify and delineate their properties:

MWER characterization: An explicit axiomatization (transitivity, monotonicity, ambiguity aversion, etc.) yields an equivalence between preference relations and minimax weighted expected regret, with further dynamic consistency axioms ensuring stability under information flow (Halpern et al., 2013).
Dynamic decision theory: Menu-selection rules parameterize which set of options enter the regret calculation, inducing problem-dependence (e.g., inclusion or exclusion of forgone opportunities), with dynamic consistency tied to the belief structure and updating mode (Halpern et al., 2015).
Regret kernels and stochastic dominance: For probabilistic lotteries, axiomatic requirements such as transitivity, stochastic dominance, and super-additivity uniquely characterize specific (problem-dependent) regret kernels that match observed choice behavior (Bardakhchyan et al., 2023).

5. Practical and Theoretical Consequences

Problem-dependent regret analysis delivers several theoretical and algorithmic benefits:

Sharper performance guarantees: Boundaries between easy and hard instances become explicit (e.g., constant regret in deterministic or concentrated loss settings; logarithmic rather than $\sqrt{T}$ scaling in small-loss regimes) (Gokcesu et al., 2023, Genalti et al., 26 May 2025, Zhang et al., 2019).
Learning and inference: In settings with ambiguity (e.g., MWER), adaptive revision of confidence in hypotheses allows regret to converge to the fundamental limit as the learner identifies the true environment (Halpern et al., 2013).
Policy design and selection: In dynamic or menu-dependent settings, the designer can select menus, update rules, and belief structures to recover or break time-consistency, adjust the extent of exploration, or mitigate undesirable behaviors such as procrastination (Halpern et al., 2015).
Instance-optimality: Lower and upper bounds, incorporating problem-specific structure, certify that no uniform algorithm can improve over the problem-dependent scaling (instance-optimality) (Tirinzoni et al., 2021, Shekhar et al., 2022).
Robustness and universality: Meta-ensembles and scale-free strategies ensure minimax optimality in worst-case regimes while automatically adapting to benign structure, thus obviating the need for adversarial tuning (Zhao et al., 25 Nov 2025).

6. Limitations and Open Questions

Despite their advantages, problem-dependent regret measures can present several subtleties:

Menu- and set-dependence: Menu-dependent regret (as in MWER and dynamic regret minimization) can violate classic decision-theoretic axioms (e.g., the independence of irrelevant alternatives), creating subtleties in both static and dynamic modeling (Halpern et al., 2013, Halpern et al., 2015).
Necessity of careful updating rules: Without appropriate likelihood updating or dynamic consistency axioms, time-inconsistent or even paradoxical preference reversals may occur (Halpern et al., 2013, Halpern et al., 2015).
Instance-complexity estimation: For certain measures (e.g., environmental norm in RL, shell packing in bandits), computation of the problem complexity may not be tractable without post hoc data or may demand domain knowledge (Zanette et al., 2019, Shekhar et al., 2022).
Incomplete generality: Not every sharp upper bound is known to be instance-optimal for all settings; in some domains matching upper/lower bounds or robustly adaptive methods remain open research directions (Zhao et al., 2023).

7. Comparative Summary Table

Domain	Problem-Dependent Measure	Instance Parameter(s)	Reference
Decision theory	MWER (weighted regret)	Ambiguity weights $\alpha_\mu$	(Halpern et al., 2013)
Bandits (general)	Data-dependent regret	Per-round observed ranges, variances	(Gokcesu et al., 2023)
Kernel bandits	Instance complexity $C(\Delta)$	Shell packing/geometry of $f$	(Shekhar et al., 2022)
Convex OCO	Gradient-variation regret	$V_T$ , path-length, comparator loss	(Zhao et al., 25 Nov 2025)
RL (episodic tabular)	Environmental norm	$\sigma_\star^2 = \max_{s,a,t} \Var[R+V]$	(Zanette et al., 2019)
RL (finite-horizon)	Gap-based LP	Policy/action gaps, KL under dynamics	(Tirinzoni et al., 2021)
RL (risk-sensitive)	Cascaded gap	Nonlinear Bellman gap, $\beta$ , $H$	(Fei et al., 2022)
Linear RL/bandits	Variance-dependent regret	Realized noise/variance sequence	(Zhao et al., 2023)
Constrained bandits	Decomposed (safety/bandit)	Loss gap of Slater point, loss sum	(Genalti et al., 26 May 2025)

Problem-dependent regret thus provides a unified yet highly granular quantification of learning complexity, simultaneously enabling sharper quantitative analysis, improved instance-adaptive algorithms, and deeper structural insights into the foundations of sequential decision-making.