Finite-Horizon Offline RL

Updated 7 October 2025

Finite-horizon offline RL is a paradigm that learns optimal sequential decisions from fixed-length episode data while carefully managing distribution shift.
It employs both model-free and model-based methods with pessimistic value estimation to mitigate extrapolation errors and ensure policy reliability.
Recent advances, including hybrid approaches and temporal abstraction, enhance sample complexity and scalability for practical offline RL applications.

Finite-horizon offline reinforcement learning (RL) studies the problem of optimizing sequential decision-making policies using a pre-collected dataset in Markov decision processes (MDPs) with a fixed, finite planning horizon. Unlike classical online RL, where the algorithm actively interacts with the environment, offline methods must address acute distribution shift and extrapolation error without the benefit of online exploration, and they must do so during episodes of fixed length, a feature that crucially impacts both sample complexity and algorithmic design.

1. Problem Setting and Key Challenges

The finite-horizon offline RL setting targets MDPs specified by $(\mathcal{S},\mathcal{A},P,r,H,\rho)$ , where $H$ is the fixed episode length. The available data $\mathcal{D}$ consists solely of transitions collected by behavioral policies—often not covering all the state–action pairs relevant for optimal control. This lack of full exploration creates fundamental statistical and algorithmic obstacles:

Distributional Shift: Policy optimization is limited to the support of $\mathcal{D}$ ; extrapolation beyond it is prone to value overestimation and poor generalization.
Finite-Horizon Issues: In episodic tasks, compounding error across the $H$ steps is exacerbated by limited coverage in the dataset, and changing behaviors across the episode complicate statistical estimation.
Policy Support Constraints: Many theoretical guarantees rely on some notion of “single-policy coverage,” requiring that state–action pairs visited by the optimal policy are sufficiently represented in $\mathcal{D}$ .

Representative sample complexity lower bounds for this setting are instance-dependent: the unavoidable suboptimality gap for any offline RL algorithm is at least

$\sum_{h=1}^H \sum_{(s,a)} d_h^{\pi^*}(s,a) \sqrt{\frac{\operatorname{Var}_{P_{s,a}}(r_h + V_{h+1}^*)}{n \cdot d_h^{\mu}(s,a)}}$

where $d_h^{\pi^*}(s,a)$ is the marginal probability under the optimal policy, $d_h^{\mu}(s,a)$ that under the behavior policy, and $n$ is number of trajectories in the dataset (Yin et al., 2021).

2. Statistical Foundations and Minimax Sample Complexity

Recent work provides nearly-tight minimax characterizations for tabular finite-horizon settings. For example, algorithms based on double variance reduction (OPDVR) or pessimistic value iteration achieve $\widetilde{O}(H^2/d_m\epsilon^2)$ episodes for $\epsilon$ -optimality, with $d_m$ the minimal marginal coverage (Yin et al., 2021). This represents an $H$ -factor improvement over earlier results (e.g., those based on uniform convergence or standard fitted Q-iteration) which required $\widetilde{O}(H^3/d_m\epsilon^2)$ . For model-based methods, pessimistic Value Iteration with Bernstein-style penalties yields a sample complexity of $\widetilde{O}(H^4 S C^\star_{\mathsf{clipped}}/\epsilon^2)$ , where $C^\star_{\mathsf{clipped}}$ is a clipped single-policy concentrability coefficient and $S$ is the number of states (Li et al., 2022).

These studies show that “pessimism”—subtracting an uncertainty penalty in value updates—enables statistically efficient, conservative policy optimization even under limited support, and the finite-horizon structure allows separation of error per timestep. Matching lower bounds, expressed via quantities such as the instance-optimal sum above (Yin et al., 2021), confirm the near-optimality of these approaches.

3. Algorithmic Paradigms: Model-Free, Model-Based, and Hybrid Methods

Model-Free Approaches

Model-free offline RL algorithms perform Q-learning or policy optimization directly on the data, often with regularization to keep the learned policy close to the behavior policy. Examples include Conservative Q-Learning (CQL), Batch-Constrained Q-learning (BCQ), and TD3+BC. The principle of pessimism is instantiated via lower-confidence bounds or subtractive bonuses—such as the anti-exploration bonus using prediction errors from a conditional variational autoencoder (Rezaeifar et al., 2021). Model-free approaches are generally robust to high-dimensional noise and partial observability and require little task-specific tuning, but may be less sample-efficient when high-quality models are available (Swazinna et al., 2022).

Model-Based Methods

These algorithms learn a parametric model of the transition kernel from the fixed dataset and perform planning via simulated rollouts. Notably, pessimistic model-based value iteration and ensemble-based approaches with conservative estimation dominate the state of the art when reliable models can be learned (Li et al., 2022). Novel data augmentation and adversarial rollouts (e.g., MORAL) adaptively select transitions from an ensemble based on pessimistic or adversarial principles, mitigating over-optimism and error accumulation, and removing the need for explicit finite-horizon tuning (Cao et al., 26 Mar 2025).

Hybrid (Model-Free/Model-Based Fusion) Approaches

Hybrid algorithms attempt to combine the low-bias of model-free methods with the data efficiency of model-based rollouts. However, in noisy or partially observable domains with moderate episode lengths, such as industrial benchmarks, hybrid approaches often underperform due to inability to quantify epistemic uncertainty and error accumulation over extended synthetic rollouts (Swazinna et al., 2022).

Approach	Statistical Efficiency	Robustness to Noise	Tuning Requirement
Model-Free	Moderate–High	High	Low
Model-Based	Optimal if model good	Moderate	Medium (model selection)
Hybrid	Task-dependent	Often poor	High (rollout length)

4. Practical Regimes: Data Properties, Coverage, and Policy Regularization

Empirical studies confirm that the success of finite-horizon offline RL depends vitally on data properties (Monier et al., 2020). Practitioners must assess both:

Trajectory Quality: The presence of high-return (expert) traces, which favor methods like behavioral cloning.
State–Action Coverage: Sufficient diversity to enable “stitching” together superior policies from suboptimal data (e.g., variants of CRR, CQL).

Methods that navigate the quality/coverage trade-off, such as regularizing the learned policy toward the data distribution via explicit penalties or predictive bonuses (anti-exploration (Rezaeifar et al., 2021), pseudometric-based lookups (Dadashi et al., 2021)), robustly prevent the algorithm from straying outside the support of $\mathcal{D}$ . In finite-horizon settings, stitching and conservative behavior are particularly critical to realize performance gains over simple imitation-based (BC) baselines.

5. Temporal Abstraction and Hierarchical Skill Learning

Temporal abstraction—breaking tasks into a succession of skills or options—can reduce the effective planning horizon and enhance offline policy learning, especially in long-horizon or multi-stage finite tasks (Halevy et al., 2023, Qiao et al., 26 Mar 2025). Frameworks such as the Offline Skill Graph (OSG) construct a directed graph over learned low-level options (skills extracted via offline RL) and plan over this graph to solve complex tasks. When skills are represented in discrete, interpretable spaces (as in Discrete Diffusion Skill, DDS (Qiao et al., 26 Mar 2025)), hierarchical RL enables coarse-to-fine decision making: a high-level policy selects among discrete skills, each realized via a powerful low-level diffusion decoder. This division reduces complexity and error accumulation for each episode, and empirical results confirm significant gains—e.g., at least 12% improvement on AntMaze-v2—attributed to this structure.

6. Dealing with Distribution Shift and Multi-Source Heterogeneity

Several works address cross-domain and multi-agent data aggregation. For example, HetPEVI (Shi et al., 2023) extends pessimistic value iteration to settings where datasets are collected from $L$ perturbed MDPs. The error is controlled by source and sample uncertainties: $\Gamma_h(s,a) = \min \left\{ \Gamma_h^\alpha(s,a) + \Gamma_h^\beta(s,a),\, H \right\}$ where $\Gamma_h^\alpha$ reflects per-source sample error, and $\Gamma_h^\beta$ penalizes when few sources cover a state–action pair. Theoretical results establish that collective coverage across sources is sufficient for near-optimality, which is further underscored in federated RL settings (Woo et al., 8 Feb 2024), where linear speedup in sample complexity per agent is achievable as long as the union of all local datasets covers the support of the optimal policy.

7. Function Approximation and Scalability

Scaling finite-horizon offline RL to high-dimensional, continuous spaces requires robust function approximation with explicit control of estimation error. A prominent line of work analyzes the sample complexity of fitted Q-iteration with deep ReLU networks under broad regularity (Besov dynamic closure) and correlated regression targets, obtaining near-optimal rates with respect to the effective horizon $H$ , state–action dimension $d$ , smoothness parameter $\alpha$ , and distribution shift $\kappa_\mu$ (Nguyen-Tang et al., 2021): $n = \widetilde{\mathcal{O}}\left(H^{4 + 4\frac{d}{\alpha}} \kappa_\mu^{1 + \frac{d}{\alpha}} \epsilon^{-2 - 2\frac{d}{\alpha}}\right)$ Advanced analysis frameworks leverage LP formulations with error-bound induced constraint relaxations, achieving $\mathcal{O}(1/\sqrt{n})$ sample complexity under general function approximation and only partial data coverage (Ozdaglar et al., 2022). Such frameworks employ explicit occupancy measure (density ratio) variables and dual constraints to ensure that induced policies are valid and robust, even under strong distribution shift or poor coverage.

Fininte-horizon offline RL is thus characterized by a combination of statistical lower bounds driven by data coverage and variance, a spectrum of algorithmic strategies leveraging pessimism, regularization, and temporal abstraction, principled approaches for federated or multi-source data, and rigorous analysis for practical scaling. Across these dimensions, recent theoretical and empirical results demonstrate both the challenges and the efficacy of conservative, data-aware learning under finite episode constraints.