Papers
Topics
Authors
Recent
2000 character limit reached

Implicitly Normalized Forecaster (INF)

Updated 7 November 2025
  • The Implicitly Normalized Forecaster (INF) is an online learning algorithm for multi-armed bandits that uses a non-exponential potential function to implicitly normalize actions and achieve minimax regret.
  • INF-clip, a variant of INF, clips heavy-tailed rewards instead of truncating them, preserving all data points for improved robustness and data efficiency.
  • The framework extends to non-linear and gradient-free bandit optimization, offering practical advantages in adversarial scenarios and challenging small-gap conditions between arms.

The Implicitly Normalized Forecaster (INF) is a strategy for online learning and multi-armed bandit (MAB) settings that achieves minimax optimal regret rates under a variety of feedback and statistical assumptions. INF and its variants, in particular the clipped version ("INF-clip"), address both adversarial and stochastic environments, including scenarios with heavy-tailed reward distributions. Recent developments have generalized INF beyond bounded and sub-Gaussian reward models, extending its applicability to robustly handle extreme outcomes without discarding relevant data.

1. Definition and Foundational Concepts

The Implicitly Normalized Forecaster (INF) operates as an online algorithm to select actions in MAB or combinatorial online optimization tasks. It maintains an implicit normalization of action probabilities by utilizing a potential function other than the classical exponential, enabling tighter control over regret bounds compared to Exp-weighted strategies. For classic adversarial MAB with bounded rewards, INF achieves minimax regret up to constant factors.

The INF-clip algorithm extends the foundational INF to cases with heavy-tailed reward distributions. Unlike previous approaches (HTINF, AdaTINF) that use truncation (discarding extreme rewards), INF-clip employs clipping, preserving all observations but capping their magnitude to reduce tail sensitivity.

2. Algorithmic Framework and Mathematical Formulation

INF algorithms are instantiated within the Online Mirror Descent (OMD) paradigm, leveraging a regularizer derived from the potential function. For INF-clip in heavy-tailed MAB:

  1. Heavy-Tailed Assumption:
    • Rewards XiX_i for each arm ii satisfy $\E[|X_i|^{1+\alpha}] \leq M^{1+\alpha}$, α(0,1]\alpha \in (0, 1].
  2. Clipped Importance-Weighted Estimator:

$\hat{g}_{t,i} = \begin{cases} \frac{\clip(g_{t,i}, \lambda)}{x_{t,i}} & \text{if } i = A_t \ 0 & \text{otherwise} \end{cases}$

  • Here, gt,ig_{t,i} is the observed reward for arm ii at round tt, xt,ix_{t,i} is its selection probability, and λ\lambda is the clipping threshold.
  1. OMD Update with Tsallis Entropy Regularization:

xt+1=argminxn[μxg^t+Bψq(x,xt)]x_{t+1} = \arg\min_{x \in \triangle_n} \left[ \mu x^\top \hat{g}_t + B_{\psi_q}(x, x_t) \right]

  • BψqB_{\psi_q} denotes the Bregman divergence induced by Tsallis entropy ψq(x)=11q(1i=1nxiq)\psi_q(x) = \frac{1}{1-q}(1 - \sum_{i=1}^n x_i^q).
  1. Regret Bound (Linear Case):

$\frac{1}{T} \E[\mathcal{R}_T(u)] = O\left( M n^{\frac{\alpha}{1+\alpha}} T^{-\frac{\alpha}{1+\alpha}} \right)$

  • This matches established lower bounds for linear heavy-tailed bandits.

3. Theoretical Guarantees and Optimality

INF-clip achieves the minimax optimal regret rate for stochastic MAB with heavy-tailed rewards without relying on restrictive reward bounds or sub-Gaussianity. The proven rate: $\frac{1}{T}\E \left [\mathcal{R}_T (u) \right ] = O\left( M n^{\frac{\alpha}{1+\alpha}} T^{-\frac{\alpha}{1+\alpha}} \right)$ with nn arms and tail index α\alpha. Convergence is established under moment conditions ($\E[|X_i|^{1+\alpha}] \leq M^{1+\alpha}$) rather than classical boundedness. This theoretical result applies in the linear setting and, with generalizations, to non-linear bandit problems under convexity and Lipschitz or smoothness assumptions.

4. Extension to Non-Linear and Gradient-Free Bandit Optimization

INF-clip adapts to non-linear stochastic MAB—where the feedback/losses depend on more general convex or Lipschitz functions—by employing gradient-free estimators. Specifically, a zeroth-order oracle setting uses one-point bandit feedback with random directions and clipped estimators: $\frac{1}{T} \E[\mathcal{R}_T] \leq 4M\tau + \Delta \frac{\sqrt{n}}{\tau} \mathcal{D}_\psi + O\left(\frac{1}{T^{\frac{\alpha}{1+\alpha}}}\right)$ Here, MM is a Lipschitz constant, τ\tau a smoothing parameter, Δ\Delta bounds adversarial noise, and Dψ\mathcal{D}_\psi is a geometric factor from the regularization.

5. Comparison with Competing Algorithms

Algorithm Outlier Handling Regret Rate (Heavy-Tailed) Data Efficiency
HTINF/AdaTINF Truncation Suboptimal when means are close, due to wasted data Lower (discards)
Robust UCB Truncation Comparable in easy regimes, worse in hard regimes Lower (discards)
Best-of-Both-Worlds Truncation/Exp Good overall, but suffers in hard, outlier-dominated Lower (discards)
INF-clip Clipping Minimax optimal, especially with close means Higher (keeps all)

INF-clip outperforms previous robust algorithms (HTINF, AdaTINF, Robust UCB, APE) and best-of-both-worlds approaches, notably in regimes with small gaps between arms and frequent rare outliers. Truncation-based methods irretrievably discard high-magnitude, informative observations, reducing learning efficacy when distinguishing arms relies on such samples.

6. Assumptions, Limitations, and Applicability

INF-clip relies on moment bounds rather than boundedness:

  • For rewards: $\E[|X_i|^{1+\alpha}] < \infty$.
  • For nonlinear losses: convexity or Lipschitzness, and bounded adversarial noise (Δ\Delta).

These mild assumptions generalize applicability to practical stochastic environments with heavy-tailed reward models, avoiding constraints of sub-Gaussianity. For nonlinear bandits, explicit constants in regret bounds depend upon problem smoothness, geometry, and tail parameters.

The algorithm's superiority is most pronounced as α0\alpha \rightarrow 0 (very heavy tails). For larger α\alpha, the relative advantage over truncation methods is reduced.

7. Empirical Results and Observed Behavior

Empirical evaluation (see Figs. 1–4 in (Dorn et al., 2023)) demonstrates that INF-clip achieves lower regret and faster identification of the optimal arm compared to robust UCB-style and best-of-both-worlds algorithms, especially in hard regimes. When differentiation between arms depends on rare outlier rewards, INF-clip's retention of all data allows more efficient learning. As a plausible implication, clipping rather than truncation should be preferred when the arms' mean rewards are nearly identical and the reward distribution is heavy-tailed.


INF and INF-clip collectively form a framework for designing online learning strategies that robustly handle adversarial and stochastic scenarios, including those with heavy-tailed and non-linear reward structures. The main innovation of INF-clip—clipping instead of truncation—improves data efficiency and minimax optimality under minimal statistical assumptions. These results establish INF-clip as an optimal and practically relevant approach for learning in heavy-tailed bandit environments, with empirical and theoretical superiority over pre-existing methods. For further details, including precise algorithmic steps and proofs, see (Dorn et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Implicitly Normalized Forecaster (INF).