Problem-Dependent Regret Bounds Overview

Updated 12 May 2026

The paper introduces refined regret bounds that leverage instance-specific parameters such as suboptimality gaps and variance to achieve logarithmic scaling in many bandit settings.
Methodologies employ martingale concentration, KL divergence, and occupancy measures to derive matching upper and lower bounds across stochastic, contextual, and reinforcement learning models.
These insights inform adaptive algorithm design by aligning exploration strategies with the statistical difficulty of the instance, ensuring near-optimal performance.

Problem-dependent regret bounds precisely characterize the performance of online learning algorithms by quantifying how regret scales with intrinsic features of the problem instance, such as suboptimality gaps, noise variances, model structure, or other data-dependent quantities, rather than solely with respect to worst-case measures like time horizon or number of actions. These bounds underpin algorithmic optimality in multi-armed bandits, contextual bandits, reinforcement learning, online convex optimization, and adversarial bandit frameworks by reflecting the statistical or structural “difficulty” of a given instance, often matching instance-specific lower bounds up to constant or logarithmic factors.

1. Classical Problem-dependent Regret in Stochastic Bandits

The prototypical setting for problem-dependent regret is the stochastic $N$ -armed bandit. Here, each arm $i$ has an unknown mean reward $\mu_i$ , and the suboptimality gap $\Delta_i = \mu^* - \mu_i$ (where $\mu^* = \max_i \mu_i$ ) governs the exploration–exploitation tradeoff.

The celebrated result of Agrawal & Goyal shows that Thompson Sampling with $\operatorname{Beta}(1,1)$ priors achieves a regret bound: $R(T)\le (1+\varepsilon)\sum_{i:\;\Delta_i>0}\frac{\ln T}{\Delta_i} + O\left(\frac{N}{\varepsilon^2}\right)$ for any $\varepsilon\in(0,1]$ , with all constants explicit and finite-time optimality up to a $(1+\varepsilon)$ multiplicative factor (Agrawal et al., 2012). The proof employs martingale-based techniques and concentration bounds to control the frequency and duration of pulling suboptimal arms, leading to logarithmic time scaling per arm and additive $O(N/\varepsilon^2)$ uniform slack.

This result sharpens previous asymptotic lower bounds (Lai & Robbins) and generalizes to broader bounded-reward or contextual settings via analogous argumentation.

2. Variance- and Gap-dependent Regret: Modern Directions

Beyond classical gap-dependent bounds, recent research emphasizes variance-dependent regret, further refining problem-adaptivity:

Stochastic Bandits: For arms with variance $i$ 0, the optimal regret is $i$ 1, with lower bounds matching the $i$ 2 scaling up to constants (Ito et al., 2022). This reflects that lower-variance arms can be eliminated faster, reducing exploration cost.
Linear Contextual Bandits: The minimax regret $i$ 3 can be sharpened to $i$ 4 for heteroscedastic noise sequences, with tight corresponding lower bounds $i$ 5 for both prefixed and adaptively chosen variance cases (He et al., 15 Mar 2025, Zhao et al., 2023).
Reinforcement Learning: In finite-horizon episodic MDPs, regret can be bounded by the “environmental norm”—the variance of the next-state value functions under optimality: $i$ 6 where $i$ 7 is the maximum one-step conditional variance over all $i$ 8 under the true transition and reward, strictly improving standard $i$ 9 scaling when $\mu_i$ 0 (Zanette et al., 2019).
Adversarial Bandits: First- and second-order instance quantities such as the cumulative loss of the best arm $\mu_i$ 1, or the quadratic variation $\mu_i$ 2, allow for refined (small-loss or low-variation) high-probability and expectation bounds, with tight lower bounds $\mu_i$ 3, $\mu_i$ 4 (Gerchinovitz et al., 2016, Lee et al., 2020).

3. Structured/Generalized Problem-dependent Bounds

In structured or generalized bandit models, such as structured finite-armed bandits or online convex optimization with curvature, the structure itself gives rise to sharper regret scaling:

Structured Bandits: In finite-armed bandits with dependencies among arm means via a shared latent parameter, the UCB-S algorithm achieves

$\mu_i$ 5

when the finite-regret condition fails, matching the lower bound up to constants (Lattimore et al., 2014).

Multi-agent Bandits: In decentralized bandits with a connected communication graph, the instance-dependent lower bound is

$\mu_i$ 6

matching corresponding centralized upper bounds (Xu et al., 2023).

Online Convex Optimization: For convex, exp-concave, or strongly convex sequences, UniGrad achieves regret that scales with the gradient variation $\mu_i$ 7, yielding $\mu_i$ 8 (convex), $\mu_i$ 9 (exp-concave, strongly convex), thus adapting to instance difficulty (Zhao et al., 25 Nov 2025).

4. Problem-dependent Regret in Reinforcement Learning

Problem-dependent analysis in RL provides a principled characterization of environment “difficulty” beyond worst-case parameters.

Finite-horizon MDPs: A fully problem-dependent lower bound (Tirinzoni et al., 2021) characterizes the regret via an occupancy-measure-constrained optimization: $\Delta_i = \mu^* - \mu_i$ 0 where $\Delta_i = \mu^* - \mu_i$ 1, and the occupancy constraints ensure the exploration measure is feasible under environment dynamics. This LP yields exact asymptotic constants and reveals when lower bounds depend on minimal action-gaps, maximal policy-gaps, or transition bottlenecks.
Deterministic MDPs: In DMDPs, problems decompose into cycles with cycle-dependent regret lower bounds. For disjoint graphs, the bandit lower bound is matched: for each cycle $\Delta_i = \mu^* - \mu_i$ 2

$\Delta_i = \mu^* - \mu_i$ 3

where $\Delta_i = \mu^* - \mu_i$ 4 is an information number analogous to KL-divergence (Tranos et al., 2021). The additional navigation complexity of MDPs does not inflate the regret relative to bandits in the deterministic case.

Tail Regret (Non-asymptotic Distributional Bounds): Recent results provide tight control on the distribution of regret—sub-Gaussian tails up to an instance-dependent threshold, sub-Weibull beyond—with the baseline determined by the global $\Delta_i = \mu^* - \mu_i$ 5-gap and scaling as

$\Delta_i = \mu^* - \mu_i$ 6

where $\Delta_i = \mu^* - \mu_i$ 7 is the uniform action-value gap (Khodadadian et al., 23 Nov 2025). The full tail probability bound reads

$\Delta_i = \mu^* - \mu_i$ 8

with explicit regimes and interpretability.

5. Lower and Upper Bound Methodology

The prevailing analytic approach for problem-dependent regret combines statistical lower bounds—typically variants of information-theoretic change-of-measure and Pinsker inequalities—with algorithmic upper bounds leveraging martingale concentration, KL-divergence controls, occupancy/visitation-based LPs, and fine-grained decomposition of the regret into instance-specific terms.

In stochastic and structured bandits, the gap- and variance-aware arguments explicitly reveal how regret must scale with $\Delta_i = \mu^* - \mu_i$ 9 and $\mu^* = \max_i \mu_i$ 0, with martingale-based analyses delivering matching upper bounds in Thompson Sampling and UCB-type algorithms (Agrawal et al., 2012, Ito et al., 2022).

For contextual and linear bandits, variance-peeling and multi-layer construction, combined with Freedman-type concentration for martingales, yield tight matching upper and lower bounds in terms of $\mu^* = \max_i \mu_i$ 1 (He et al., 15 Mar 2025, Zhao et al., 2023). In online convex optimization, meta-algorithms over curvature discretization guarantee regret scaling with gradient variation (Zhao et al., 25 Nov 2025).

RL admits problem-dependent asymptotics via occupancy-constrained linear programs, with the complexity determined by the feasible allocation of visits to state-actions or cycles and associated KL-information rates (Tirinzoni et al., 2021, Tranos et al., 2021).

6. Data-dependent and Adversarial Extensions

Problem-dependence extends to data-dependent and adversarial settings:

Adversarial MAB: High-probability and expected regret must scale as $\mu^* = \max_i \mu_i$ 2 where $\mu^* = \max_i \mu_i$ 3 is the loss of the best arm, or $\mu^* = \max_i \mu_i$ 4 for quadratic variation $\mu^* = \max_i \mu_i$ 5, even when these quantities are much less than $\mu^* = \max_i \mu_i$ 6 (Gerchinovitz et al., 2016, Lee et al., 2020). No improvement is possible merely due to the existence of an always-optimal arm or small per-round loss range.
Constrained Bandits: In settings with hard stochastic constraints, regret admits a decomposition into safety-complexity and learning-complexity, both data-dependent and necessary: $\mu^* = \max_i \mu_i$ 7 where $\mu^* = \max_i \mu_i$ 8 is the Slater gap, $\mu^* = \max_i \mu_i$ 9 strictly feasible, and $\operatorname{Beta}(1,1)$ 0 optimal (Genalti et al., 26 May 2025).

7. Significance and Open Directions

Problem-dependent regret analysis is central to algorithmic optimality, practical efficiency, and instance-tuned performance guarantees:

Establishes sharp separations between easy and hard instances, clarifies the effect of gaps, variance, structural information, curvature, and communication patterns.
Informs parameter tuning (e.g., $\operatorname{Beta}(1,1)$ 1, variance penalties), guides exploration/exploitation design, and enables anytime/data-adaptive algorithms.
Unifies diverse settings—classical bandits, contextual, RL, adversarial, OCO—under a statistical lens via information and complexity measures.

Open questions remain on better control or elimination of additive slack in finite-time upper bounds, adaptivity to additional forms of instance difficulty (e.g., heavy tails, non-stationarity), computational/statistical tradeoffs in structured/generalized settings, and the extension of sharp problem-dependent tail bounds to broader classes of online decision processes.