Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

Published 9 May 2026 in cs.LG, cs.AI, cs.IT, math.ST, and stat.ML | (2605.09214v1)

Abstract: \emph{Kullback-Leibler} (KL) regularization is ubiquitous in reinforcement learning algorithms in the form of \emph{reverse} or \emph{forward} KL. Recent studies have demonstrated $ε^{-1}$-type fast rates for decision making under reverse KL regularization, in contrast to the standard $ε^{-2}$-type sample complexity. However, for forward-KL-regularized objectives, existing statistical analyses are either not applicable or result in $\tilde{O}(ε^{-2})$ slow rates. We take the first step towards addressing this problem via a streamlined analysis of forward-KL-regularized offline CBs. We give the first $\tilde{O}(ε^{-1})$ upper bounds in tabular and general function approximation settings, both under notions of \emph{single-policy concentrability}. In particular, our convex-analytical pipeline unifies these settings by exploiting the pessimism principle in a novel way and completely bypasses the proof routines in previous works based on the mean value theorem, which might be of independent interest. Moreover, we provide rate-optimal lower bounds, manifesting the tightness of our upper bounds in terms of statistical rates. Our lower bounds also demonstrate that the forward-KL-regularized sample complexity recovers the unregularized slow rate in the low-regularization regime, similarly to the reverse-KL regularization.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper establishes that forward-KL regularization achieves fast O(1/ε) rates for offline contextual bandits under single-policy concentrability, enhancing data efficiency.
It introduces a pessimism-based algorithm with tight upper bounds and matching lower bounds validated for both tabular and function approximation settings.
Novel convex duality techniques replace standard Taylor methods, bridging theoretical analysis and practical performance in offline reinforcement learning.

Fast Rates for Forward-KL Regularized Offline Contextual Bandits under Single-Policy Concentrability

Introduction and Contextualization

The paper "Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability" (2605.09214) addresses a foundational question in the statistical analysis of KL-regularized reinforcement learning (RL) and contextual bandits (CBs): What are the fundamental data-efficiency limits for offline policy learning when using forward Kullback-Leibler (KL) divergence regularization, and what coverage conditions are sufficient to achieve optimal sample complexity?

Reverse-KL regularization has been thoroughly analyzed, with fast rates (sample complexity $\tilde{O}(1/\epsilon)$ for suboptimality gap $\epsilon$ ) established under relatively weak assumptions such as single-policy concentrability. In contrast, forward-KL regularization—despite its increasing prevalence in RL, RLHF, and LLM finetuning pipelines—has only previously been shown to yield slower, $\tilde{O}(1/\epsilon^2)$ rates, or fast rates only under restrictive (e.g., all-policy) concentrability.

This paper explicitly addresses two open problems:

Achievability of fast $(\tilde{O}(1/\epsilon))$ rates for offline forward-KL regularized CBs: It demonstrates, for the first time, that such rates are possible both in tabular and general function approximation settings.
Sufficiency of single-policy concentrability: It proves that single-policy coverage is indeed adequate for fast rates, even for forward-KL regularization, which is a weaker and more practical assumption for real-world data collection.

Forward-KL Regularization and Statistical Setting

Consider offline contextual bandits where only an i.i.d. dataset from some behavior policy is available. Goal: Find a policy maximizing

$J(\pi) = \mathbb{E}_{s \sim \rho, a \sim \pi(\cdot | s)}[r(s,a)] - \eta^{-1} \mathrm{KL}(\pi(\cdot|s) \Vert \mu(\cdot|s)),$

where the reference policy $\mu$ is fixed, and $\eta$ scales the penalty. The agent's access to $\mu$ may be limited, and data is only collected through $\mu$ .

The challenge: distributional shift between $\mu$ and the optimal policy $\epsilon$ 0 leads to estimation error. Coverage assumptions formalize how this gap can be controlled—concentrability quantifies how well $\epsilon$ 1 covers $\epsilon$ 2. There are two main notions:

All-policy concentrability: $\epsilon$ 3
Single-policy concentrability: $\epsilon$ 4

Single-policy is strictly weaker and more typical in offline RL, but had not been shown sufficient for forward-KL in previous work.

Main Theoretical Contributions

1. Algorithms with Fast Sample Complexity Bounds

The authors propose a streamlined "pessimism-based" CB algorithm using forward-KL regularization, providing procedures for both tabular and function-approximation regimes.

Tabular Setting

Sample complexity: $\epsilon$ 5
Only the single-policy concentrability coefficient $\epsilon$ 6 appears in the bound, marking a strict improvement over prior work with all-policy requirements.

Function Approximation

Sample complexity: $\epsilon$ 7
Incorporates a $\epsilon$ 8-type concentrability capturing function-class-specific coverage, and a log covering number due to the function class complexity.
Significantly, the rate is still $\epsilon$ 9 as opposed to existing $\tilde{O}(1/\epsilon^2)$ 0 results [e.g., "KL-Regularized RLHF with Multiple Reference Models" [aminian2025kl]].

2. Minimax Lower Bounds and Optimality

The paper establishes matching lower bounds in the tabular setting:

For tabular CBs, any estimator requires $\tilde{O}(1/\epsilon^2)$ 1 samples to guarantee $\tilde{O}(1/\epsilon^2)$ 2-optimality under forward-KL—demonstrating the tightness of the upper bounds.

Moreover, the authors detail a "phase transition": in the high-regularization regime, the $\tilde{O}(1/\epsilon^2)$ 3 rate holds, but as regularization weakens ( $\tilde{O}(1/\epsilon^2)$ 4 large), sample complexity recovers the classical $\tilde{O}(1/\epsilon^2)$ 5 dependency. This mirrors the transition observed for reverse-KL, confirming a fundamental phenomenon rather than an artifact of analysis.

3. Novel Analytical Tools

A primary technical advancement is a convex-analytical approach leveraging conjugate duality, bypassing traditional Taylor/mean-value expansion techniques that fail for forward-KL. In contrast to reverse-KL, where such tools suffice, the authors show that the lack of strong convexity for forward-KL (it is only strictly convex) prevents fast-rate, single-coverage analyses via those methods. The new approach directly exploits the pessimism principle combined with a novel suboptimality decomposition enabled by convex duality.

This decomposition sidesteps dependencies on the minimal support of the reference policy and clarifies why all-policy coverage is unnecessary for forward-KL, thus unifying the analysis for tabular and function approximation scenarios.

Strong Numerical and Theoretical Claims

First establishment of $\tilde{O}(1/\epsilon^2)$ 6 upper bounds for forward-KL-regularized offline CBs under single-policy concentrability.
Lower bounds that precisely match the upper bounds in both $\tilde{O}(1/\epsilon^2)$ 7 and $\tilde{O}(1/\epsilon^2)$ 8 dependence.
The combination of sample complexity improvement and relaxed coverage conditions is uniquely shown here for forward-KL (previously known for reverse-KL only).
Rigorous demonstration that classic mean-value or strong convexity arguments are insufficient for the forward-KL case—necessitating the new analytical pipeline.

Implications and Future Directions

Practical Implications:

These results validate the use of forward-KL regularization with weak coverage assumptions in large-scale offline RL and RLHF settings (including modern LLMs). They justify the empirical resilience of forward-KL-based objectives when behavior policies only modestly cover the optimal policy, and suggest that data efficiency should match that of reverse-KL settings in these regimes.

Theoretical Implications:

The technical tools developed—convex conjugate-based suboptimality decomposition and self-bounding arguments—form a new analytical backbone applicable to a wider class of regularized objectives beyond $\tilde{O}(1/\epsilon^2)$ 9-divergences (including settings without strong-convexity). This could fundamentally broaden fast-rate theory for structured offline learning problems.

Open Problems:

Proving the necessity (not just sufficiency) of single-policy concentrability for fast rates under forward-KL.
Extending the tight lower bounds for function approximation, where only upper bounds are established in this work.
Characterizing the sharp phase-transition threshold between fast and slow rates as a function of $(\tilde{O}(1/\epsilon))$ 0 for forward-KL regularization.

Conclusion

This work resolves key open questions in the statistical theory of offline RL with forward-KL regularization, demonstrating that single-policy concentrability suffices for fast $(\tilde{O}(1/\epsilon))$ 1 rates and providing tight minimax lower bounds. The technical innovations not only yield practical guidance for offline RLHF system designers but also pave the way for new analytical methods in regularized RL theory. Future research should focus on establishing sharp necessity results, improved lower bounds under general function approximation, and generalizations to broader regularization frameworks and online learning settings.