Papers
Topics
Authors
Recent
Search
2000 character limit reached

Problem-Dependent Regret Bounds Overview

Updated 12 May 2026
  • The paper introduces refined regret bounds that leverage instance-specific parameters such as suboptimality gaps and variance to achieve logarithmic scaling in many bandit settings.
  • Methodologies employ martingale concentration, KL divergence, and occupancy measures to derive matching upper and lower bounds across stochastic, contextual, and reinforcement learning models.
  • These insights inform adaptive algorithm design by aligning exploration strategies with the statistical difficulty of the instance, ensuring near-optimal performance.

Problem-dependent regret bounds precisely characterize the performance of online learning algorithms by quantifying how regret scales with intrinsic features of the problem instance, such as suboptimality gaps, noise variances, model structure, or other data-dependent quantities, rather than solely with respect to worst-case measures like time horizon or number of actions. These bounds underpin algorithmic optimality in multi-armed bandits, contextual bandits, reinforcement learning, online convex optimization, and adversarial bandit frameworks by reflecting the statistical or structural “difficulty” of a given instance, often matching instance-specific lower bounds up to constant or logarithmic factors.

1. Classical Problem-dependent Regret in Stochastic Bandits

The prototypical setting for problem-dependent regret is the stochastic NN-armed bandit. Here, each arm ii has an unknown mean reward μi\mu_i, and the suboptimality gap Δi=μμi\Delta_i = \mu^* - \mu_i (where μ=maxiμi\mu^* = \max_i \mu_i) governs the exploration–exploitation tradeoff.

The celebrated result of Agrawal & Goyal shows that Thompson Sampling with Beta(1,1)\operatorname{Beta}(1,1) priors achieves a regret bound: R(T)(1+ε)i:  Δi>0lnTΔi+O(Nε2)R(T)\le (1+\varepsilon)\sum_{i:\;\Delta_i>0}\frac{\ln T}{\Delta_i} + O\left(\frac{N}{\varepsilon^2}\right) for any ε(0,1]\varepsilon\in(0,1], with all constants explicit and finite-time optimality up to a (1+ε)(1+\varepsilon) multiplicative factor (Agrawal et al., 2012). The proof employs martingale-based techniques and concentration bounds to control the frequency and duration of pulling suboptimal arms, leading to logarithmic time scaling per arm and additive O(N/ε2)O(N/\varepsilon^2) uniform slack.

This result sharpens previous asymptotic lower bounds (Lai & Robbins) and generalizes to broader bounded-reward or contextual settings via analogous argumentation.

2. Variance- and Gap-dependent Regret: Modern Directions

Beyond classical gap-dependent bounds, recent research emphasizes variance-dependent regret, further refining problem-adaptivity:

  • Stochastic Bandits: For arms with variance ii0, the optimal regret is ii1, with lower bounds matching the ii2 scaling up to constants (Ito et al., 2022). This reflects that lower-variance arms can be eliminated faster, reducing exploration cost.
  • Linear Contextual Bandits: The minimax regret ii3 can be sharpened to ii4 for heteroscedastic noise sequences, with tight corresponding lower bounds ii5 for both prefixed and adaptively chosen variance cases (He et al., 15 Mar 2025, Zhao et al., 2023).
  • Reinforcement Learning: In finite-horizon episodic MDPs, regret can be bounded by the “environmental norm”—the variance of the next-state value functions under optimality: ii6 where ii7 is the maximum one-step conditional variance over all ii8 under the true transition and reward, strictly improving standard ii9 scaling when μi\mu_i0 (Zanette et al., 2019).
  • Adversarial Bandits: First- and second-order instance quantities such as the cumulative loss of the best arm μi\mu_i1, or the quadratic variation μi\mu_i2, allow for refined (small-loss or low-variation) high-probability and expectation bounds, with tight lower bounds μi\mu_i3, μi\mu_i4 (Gerchinovitz et al., 2016, Lee et al., 2020).

3. Structured/Generalized Problem-dependent Bounds

In structured or generalized bandit models, such as structured finite-armed bandits or online convex optimization with curvature, the structure itself gives rise to sharper regret scaling:

  • Structured Bandits: In finite-armed bandits with dependencies among arm means via a shared latent parameter, the UCB-S algorithm achieves

μi\mu_i5

when the finite-regret condition fails, matching the lower bound up to constants (Lattimore et al., 2014).

  • Multi-agent Bandits: In decentralized bandits with a connected communication graph, the instance-dependent lower bound is

μi\mu_i6

matching corresponding centralized upper bounds (Xu et al., 2023).

  • Online Convex Optimization: For convex, exp-concave, or strongly convex sequences, UniGrad achieves regret that scales with the gradient variation μi\mu_i7, yielding μi\mu_i8 (convex), μi\mu_i9 (exp-concave, strongly convex), thus adapting to instance difficulty (Zhao et al., 25 Nov 2025).

4. Problem-dependent Regret in Reinforcement Learning

Problem-dependent analysis in RL provides a principled characterization of environment “difficulty” beyond worst-case parameters.

  • Finite-horizon MDPs: A fully problem-dependent lower bound (Tirinzoni et al., 2021) characterizes the regret via an occupancy-measure-constrained optimization: Δi=μμi\Delta_i = \mu^* - \mu_i0 where Δi=μμi\Delta_i = \mu^* - \mu_i1, and the occupancy constraints ensure the exploration measure is feasible under environment dynamics. This LP yields exact asymptotic constants and reveals when lower bounds depend on minimal action-gaps, maximal policy-gaps, or transition bottlenecks.
  • Deterministic MDPs: In DMDPs, problems decompose into cycles with cycle-dependent regret lower bounds. For disjoint graphs, the bandit lower bound is matched: for each cycle Δi=μμi\Delta_i = \mu^* - \mu_i2

Δi=μμi\Delta_i = \mu^* - \mu_i3

where Δi=μμi\Delta_i = \mu^* - \mu_i4 is an information number analogous to KL-divergence (Tranos et al., 2021). The additional navigation complexity of MDPs does not inflate the regret relative to bandits in the deterministic case.

  • Tail Regret (Non-asymptotic Distributional Bounds): Recent results provide tight control on the distribution of regret—sub-Gaussian tails up to an instance-dependent threshold, sub-Weibull beyond—with the baseline determined by the global Δi=μμi\Delta_i = \mu^* - \mu_i5-gap and scaling as

Δi=μμi\Delta_i = \mu^* - \mu_i6

where Δi=μμi\Delta_i = \mu^* - \mu_i7 is the uniform action-value gap (Khodadadian et al., 23 Nov 2025). The full tail probability bound reads

Δi=μμi\Delta_i = \mu^* - \mu_i8

with explicit regimes and interpretability.

5. Lower and Upper Bound Methodology

The prevailing analytic approach for problem-dependent regret combines statistical lower bounds—typically variants of information-theoretic change-of-measure and Pinsker inequalities—with algorithmic upper bounds leveraging martingale concentration, KL-divergence controls, occupancy/visitation-based LPs, and fine-grained decomposition of the regret into instance-specific terms.

In stochastic and structured bandits, the gap- and variance-aware arguments explicitly reveal how regret must scale with Δi=μμi\Delta_i = \mu^* - \mu_i9 and μ=maxiμi\mu^* = \max_i \mu_i0, with martingale-based analyses delivering matching upper bounds in Thompson Sampling and UCB-type algorithms (Agrawal et al., 2012, Ito et al., 2022).

For contextual and linear bandits, variance-peeling and multi-layer construction, combined with Freedman-type concentration for martingales, yield tight matching upper and lower bounds in terms of μ=maxiμi\mu^* = \max_i \mu_i1 (He et al., 15 Mar 2025, Zhao et al., 2023). In online convex optimization, meta-algorithms over curvature discretization guarantee regret scaling with gradient variation (Zhao et al., 25 Nov 2025).

RL admits problem-dependent asymptotics via occupancy-constrained linear programs, with the complexity determined by the feasible allocation of visits to state-actions or cycles and associated KL-information rates (Tirinzoni et al., 2021, Tranos et al., 2021).

6. Data-dependent and Adversarial Extensions

Problem-dependence extends to data-dependent and adversarial settings:

  • Adversarial MAB: High-probability and expected regret must scale as μ=maxiμi\mu^* = \max_i \mu_i2 where μ=maxiμi\mu^* = \max_i \mu_i3 is the loss of the best arm, or μ=maxiμi\mu^* = \max_i \mu_i4 for quadratic variation μ=maxiμi\mu^* = \max_i \mu_i5, even when these quantities are much less than μ=maxiμi\mu^* = \max_i \mu_i6 (Gerchinovitz et al., 2016, Lee et al., 2020). No improvement is possible merely due to the existence of an always-optimal arm or small per-round loss range.
  • Constrained Bandits: In settings with hard stochastic constraints, regret admits a decomposition into safety-complexity and learning-complexity, both data-dependent and necessary: μ=maxiμi\mu^* = \max_i \mu_i7 where μ=maxiμi\mu^* = \max_i \mu_i8 is the Slater gap, μ=maxiμi\mu^* = \max_i \mu_i9 strictly feasible, and Beta(1,1)\operatorname{Beta}(1,1)0 optimal (Genalti et al., 26 May 2025).

7. Significance and Open Directions

Problem-dependent regret analysis is central to algorithmic optimality, practical efficiency, and instance-tuned performance guarantees:

  • Establishes sharp separations between easy and hard instances, clarifies the effect of gaps, variance, structural information, curvature, and communication patterns.
  • Informs parameter tuning (e.g., Beta(1,1)\operatorname{Beta}(1,1)1, variance penalties), guides exploration/exploitation design, and enables anytime/data-adaptive algorithms.
  • Unifies diverse settings—classical bandits, contextual, RL, adversarial, OCO—under a statistical lens via information and complexity measures.

Open questions remain on better control or elimination of additive slack in finite-time upper bounds, adaptivity to additional forms of instance difficulty (e.g., heavy tails, non-stationarity), computational/statistical tradeoffs in structured/generalized settings, and the extension of sharp problem-dependent tail bounds to broader classes of online decision processes.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Problem-Dependent Regret Bounds.