Statistical and Algorithmic Foundations of Reinforcement Learning (2507.14444v1)

Published 19 Jul 2025 in stat.ML, cs.AI, cs.LG, math.OC, math.ST, and stat.TH

Abstract: As a paradigm for sequential decision making in unknown environments, reinforcement learning (RL) has received a flurry of attention in recent years. However, the explosion of model complexity in emerging applications and the presence of nonconvexity exacerbate the challenge of achieving efficient RL in sample-starved situations, where data collection is expensive, time-consuming, or even high-stakes (e.g., in clinical trials, autonomous systems, and online advertising). How to understand and enhance the sample and computational efficacies of RL algorithms is thus of great interest. In this tutorial, we aim to introduce several important algorithmic and theoretical developments in RL, highlighting the connections between new ideas and classical topics. Employing Markov Decision Processes as the central mathematical model, we cover several distinctive RL scenarios (i.e., RL with a simulator, online RL, offline RL, robust RL, and RL with human feedback), and present several mainstream RL approaches (i.e., model-based approach, value-based approach, and policy optimization). Our discussions gravitate around the issues of sample complexity, computational efficiency, as well as algorithm-dependent and information-theoretic lower bounds from a non-asymptotic viewpoint.

Summary

The paper presents rigorous non-asymptotic analyses of RL’s statistical sample complexity and optimality bounds across various settings.
It compares model-based, model-free, and policy optimization methods, quantifying trade-offs in convergence rates and horizon dependencies.
Implications for online, offline, robust, and human feedback RL illustrate practical limits and avenues for scalable, uncertainty-aware algorithms.

Statistical and Algorithmic Foundations of Reinforcement Learning: A Technical Overview

This tutorial provides a comprehensive and rigorous treatment of the statistical and algorithmic underpinnings of reinforcement learning (RL), with a focus on sample complexity, computational efficiency, and information-theoretic lower bounds in both classical and modern RL settings. The discussion is organized around Markov Decision Processes (MDPs) and covers a spectrum of RL paradigms, including RL with simulators (generative models), online RL, offline RL, robust RL, and RL with human feedback. The analysis is non-asymptotic and emphasizes minimax optimality, algorithmic trade-offs, and the interplay between statistical and computational considerations.

Markov Decision Processes and Dynamic Programming

The tutorial begins by formalizing the MDP framework, detailing both discounted infinite-horizon and finite-horizon settings. The Bellman optimality operator and its contraction properties are established, providing the foundation for classical dynamic programming algorithms such as policy iteration and value iteration. The linear convergence of these algorithms is quantified, with iteration complexity scaling linearly in the effective horizon $1/(1-\gamma)$ .

RL with a Generative Model: Sample Complexity and Algorithmic Optimality

Model-Based Approaches

Model-based RL algorithms estimate the transition kernel and reward function from samples, construct an empirical MDP, and solve for the optimal policy via dynamic programming. The tutorial presents both standard and perturbed variants, the latter facilitating sharper theoretical analysis. The main result is that model-based algorithms achieve minimax-optimal sample complexity (up to logarithmic factors) for the full range of target accuracies:

$N = \widetilde{O}\left(\frac{SA}{(1-\gamma)^3 \varepsilon^2}\right)$

per state-action pair suffices for $\varepsilon$ -optimality, matching the information-theoretic lower bound. Notably, this holds for all $\varepsilon \in (0, 1/(1-\gamma)]$ , eliminating burn-in costs.

Model-Free Approaches

Q-learning, as a canonical model-free algorithm, is analyzed in the synchronous setting. The sharpest known non-asymptotic bounds are presented:

$N = \widetilde{O}\left(\frac{SA}{(1-\gamma)^4 \varepsilon^2}\right)$

is sufficient for $\ell_\infty$ -accuracy. However, a matching algorithm-dependent lower bound demonstrates that this scaling is unimprovable for vanilla Q-learning, revealing a sub-optimality gap of $1/(1-\gamma)$ compared to model-based methods. Variance-reduced Q-learning variants can close this gap, but standard Q-learning remains sub-optimal in horizon dependence. In the special case $A=1$ (TD learning), minimax-optimality is achieved.

Online RL: Regret Minimization and Optimism

In the online episodic setting, the agent interacts with the environment via trajectories, and the goal is to minimize cumulative regret. The minimax lower bound is

$\Omega\left(\sqrt{SAH^3K}\right)$

for $K$ episodes of horizon $H$ . The UCBVI algorithm achieves this rate asymptotically but suffers from a large burn-in cost. The MVP (Monotonic Value Propagation) algorithm eliminates this burn-in via epoch-based updates and monotonic Bernstein-style bonuses, achieving minimax-optimal regret for all $K$ :

$\mathsf{Regret}(K) \lesssim \sqrt{SAH^3K} \cdot \mathrm{polylog}(SAHK)$

This translates to a PAC sample complexity of $\widetilde{O}(SAH^3/\varepsilon^2)$ for $\varepsilon$ -optimality, which is unimprovable even with a generative model.

Offline RL: Pessimism, Concentrability, and Optimality

Offline RL is addressed via the principle of pessimism in the face of uncertainty, operationalized through lower confidence bounds in value estimation. The key metric is the single-policy concentrability coefficient $C^\star$ , quantifying the mismatch between the optimal policy's occupancy and the data distribution. The VI-LCB (Value Iteration with Lower Confidence Bounds) algorithm achieves

$N = \widetilde{O}\left(\frac{C^\star S}{(1-\gamma)^3 \varepsilon^2}\right)$

sample complexity for $\varepsilon$ -optimality, matching the minimax lower bound. This result holds for the entire accuracy range and does not require uniform data coverage, only sufficient coverage along the optimal policy's trajectory.

Policy Optimization: Gradient Methods and Regularization

Policy optimization is formalized as direct maximization of the value function over parameterized policies. The tutorial rigorously analyzes projected policy gradient, softmax policy gradient, and natural policy gradient (NPG) methods. Projected PG achieves global convergence with iteration complexity scaling as $O(SA/(1-\gamma)^6\varepsilon^2)$ , while softmax PG can suffer from exponential slowdowns in certain MDPs. NPG, especially with entropy regularization, achieves dimension-free linear convergence rates:

$O\left(\frac{1}{(1-\gamma)^2 T}\right)$

for the sub-optimality gap, with further acceleration possible via adaptive learning rates or regularization.

Distributionally Robust RL

Distributionally robust MDPs (RMDPs) are introduced to address sim-to-real gaps and model uncertainty. The robust Bellman operator is defined, and strong duality is leveraged for efficient computation. For total variation (TV) uncertainty sets, the sample complexity for robust policy learning is

$\widetilde{O}\left(\frac{SA}{(1-\gamma)^2 \max\{1-\gamma, \sigma\} \varepsilon^2}\right)$

where $\sigma$ is the uncertainty radius. This can be strictly smaller than the standard MDP sample complexity for large $\sigma$ , but for other divergences (e.g., $\chi^2$ ), the problem can be strictly harder.

RL with Human Feedback (RLHF)

The RLHF pipeline is formalized as a two-stage process: reward modeling from human preference data (modeled via the Bradley-Terry model) and RL fine-tuning with KL regularization. The DPO (Direct Preference Optimization) formulation is derived, and the VPO (Value-incentivized Preference Optimization) framework is introduced to address reward uncertainty and calibration. VPO regularizes the reward MLE with the optimal value, yielding regret and sample complexity guarantees matching those of standard contextual bandits and offline RL, under linear function approximation and sufficient data coverage.

Implications and Future Directions

The tutorial demonstrates that, under tabular assumptions and with access to appropriate sampling modalities, minimax-optimal sample and regret bounds are achievable for a wide range of RL settings. However, these results often rely on idealized assumptions (e.g., tabular representations, access to simulators, or sufficient data coverage). In large-scale or function-approximation regimes, practical algorithms such as DQN, PPO, and CQL are often preferred despite weaker theoretical guarantees.

Several open directions are highlighted:

Hybrid RL: Combining online and offline data sources for improved sample efficiency.
Federated RL: Leveraging distributed data collection and optimization.
Uncertainty Quantification: Developing valid confidence intervals and distributional guarantees for value functions.
Function Approximation: Extending minimax-optimality and sample efficiency results to general function classes, including neural networks.

Summary Table: Sample Complexity and Regret Bounds

Setting	Algorithm/Class	Sample/Regret Bound	Optimality
Generative Model RL	Model-based	$\widetilde{O}(SA/(1-\gamma)^3\varepsilon^2)$	Minimax-optimal
Generative Model RL	Q-learning	$\widetilde{O}(SA/(1-\gamma)^4\varepsilon^2)$	Sub-optimal
Online RL (episodic)	MVP	$\widetilde{O}(SAH^3/\varepsilon^2)$ (PAC)	Minimax-optimal
Offline RL	VI-LCB	$\widetilde{O}(C^\star S/(1-\gamma)^3\varepsilon^2)$	Minimax-optimal
Policy Optimization	NPG (entropy reg.)	$O(1/((1-\gamma)^2 T))$ (sub-opt. gap)	Linear convergence
Robust RL (TV)	DRVI	$\widetilde{O}(SA/((1-\gamma)^2 \max\{1-\gamma,\sigma\}\varepsilon^2))$	Minimax-optimal (TV)
RLHF (online/offline)	VPO	$\widetilde{O}(\sqrt{T})$ (regret), $\widetilde{O}(1/\sqrt{N})$ (offline)	Minimax-optimal

Theoretical and Practical Implications

The results in this tutorial establish that, for tabular RL, the fundamental trade-offs between sample complexity, statistical accuracy, and computational efficiency are now well-understood, with tight upper and lower bounds in most settings. The principles of optimism (for exploration) and pessimism (for distribution shift) are central to algorithmic design. In practice, however, the gap between theory and large-scale RL remains significant, motivating ongoing research in scalable, robust, and uncertainty-aware RL algorithms.

Future developments in AI will likely focus on bridging this gap, extending minimax-optimality to function approximation, integrating hybrid and federated data modalities, and developing principled uncertainty quantification for RL in high-dimensional and non-tabular settings.