Sequential Reinforcement Learning

Updated 9 November 2025

Sequential reinforcement learning is a framework that extends MDP/POMDP models to tackle tasks presented in a fixed sequence with unique dynamics and reward structures.
It leverages techniques such as reward relabeling, selective replay, and importance sampling to efficiently transfer and reuse experiences across sequential tasks.
Hierarchical and blockwise architectures enable robust handling of partially observable and decentralized environments, ensuring improved sample efficiency and performance.

Sequential reinforcement learning (RL) concerns algorithms and theoretical frameworks explicitly designed to handle sequentially structured environments, data streams, or task presentations, where temporality or data availability fundamentally departs from joint or simultaneous access assumptions. This encompasses, among other areas, lifelong multi-task RL with sequential task arrival, RL with explicit modeling of rich information structures and dependencies (often partial observability, limited-memory, or decentralized settings), hierarchical RL for sequential compositionality of primitives, and RL-driven optimization for sequential experimental design. In practical applications and theory alike, the key challenge is to enable agents to efficiently leverage temporally ordered access to environment dynamics and data—frequently with strong constraints on memory, transfer, or re-use—while retaining sample efficiency, performance, and robustness across tasks that must be learned (or solved) in a fixed or unknown sequence.

1. Formal Problem Definitions and Sequential Setting

Sequential RL extends the standard Markov decision process (MDP) or partially observable MDP (POMDP) framework to scenarios where the agent's environment and learning protocol are temporally ordered beyond stepwise decision-making. The prototypical lifelong or sequential multi-task RL problem, as formalized in "Lifelong Robotic Reinforcement Learning by Retaining Experiences" (Xie et al., 2021), involves a sequence of $N$ tasks $T^1, T^2, \ldots, T^N$ , with each $T^i$ specified as a (possibly continuous) MDP over a shared state and action space but with unique transition density $p^i(s'|s,a)$ and reward $r^i(s,a)$ . Critically, access to environment interactions for each task is staged: at iteration $i$ , the agent interacts only with $T^i$ and cannot revisit earlier tasks in the environment, imposing strict causality and prohibiting round-robin or joint data collection.

In more general sequential decision frameworks such as POST/POSG (Altabaa et al., 1 Mar 2024), the environment is explicitly modeled via a directed acyclic graph (DAG) structure on system and action variables $X_1,\ldots,X_T$ , with variable-specific information sets $I_t$ indicating which past variables can influence or be observed by the decision at $t$ . This generalizes the MDP family to encompass arbitrary causal dependencies, observability patterns, and memory structures, including mean-field, limited-memory, or decentralized agent protocols.

In these settings, sequentiality is not merely an implementation detail but is reflected in the fundamental constraints of information, data availability, or task ordering, often with implications for tractability, sample complexity, and algorithmic structure.

2. Core Methodologies: Retention, Transfer, and Sequential Policy Optimization

Algorithms for sequential RL are characterized by mechanisms for memory, retention, and transfer across temporally staggered tasks or blocks while maintaining efficiency and robustness.

Retaining Past Experience with Reward Relabeling: When facing a new task $T^i$ of a known sequential ordering, it is possible (provided the reward is known in closed form) to relabel all past transitions $\{(s,a,s',r)\}$ from tasks $1,\ldots, i-1$ using the current task's reward function $r^i(s,a)$ . This constructs a replay buffer $D^{src}$ encoding historical interactions recast as if they were sampled from the current reward, even if system dynamics differ. Off-policy RL algorithms such as Soft Actor-Critic (SAC) can then be pre-trained on $D^{src}$ before any new data is collected on $T^i$ , initializing policy and value networks closer to optimality for the incoming task (Xie et al., 2021).

Selective Replay and Importance Filtering: Since arbitrary mixing of off-task transitions may introduce bias due to mismatched dynamics, domain-classifier-based importance filtering is employed. A binary classifier $c_\psi(s,a,s')$ is trained to discriminate transitions originating from $T^i$ ("target") vs. prior tasks ("source"). The ratio $c_\psi/(1-c_\psi)$ —an empirical estimate of the likelihood ratio—serves to filter replayed data, retaining only transitions sufficiently similar (as measured by a threshold $\gamma_{thr}$ ) to the new task's dynamics. This process is iteratively refined as more on-task data is collected, with batch mixing ratios decaying to 100% new data as task-specific experience accumulates.

Sequential Learning in Partially Observable and Structured Environments: In cases where state information is partial and temporal context is essential, architectures such as those introduced in "Blockwise Sequential Model Learning for Partially Observable Reinforcement Learning" (Park et al., 2021) segment agent experience into blocks (length $L$ ), perform intra-block representation learning (e.g., via self-attention), propagate latent states across blocks (e.g., blockwise GRU/LSTM), and filter per-step inputs for downstream policy optimization using latent and contextual information, with off-policy RL applied at the control level.

Hierarchical and Parameterized Sequential RL: In physically instantiated sequential tasks, hierarchical frameworks select and parameterize temporally extended primitives (skills), potentially with dynamic mechanisms for online adaptation (such as adaptive impedance modulation in robotic contact tasks (Tahmaz et al., 27 Aug 2025)).

The table below summarizes primary sequential RL methodology axes:

Methodology	Application Domain	Key Mechanisms
Replay & Relabel	Lifelong multi-task RL	Off-policy pre-training, reward relabeling, classifier-based importance sampling
Blockwise Modeling	POMDPs, sequential scenes	Self-attention within blocks, blockwise RNNs, latent variable propagation
Hierarchical RL	Sequential robotics	Primitive selection, adaptive control, affordance coupling
Explicit Info Structure	Teams/games, PSR learning	DAG-based dependency modeling, sample complexity via separator size

3. Sample Complexity, Theoretical Bounds, and Algorithmic Guarantees

The statistical and computational efficiency of sequential RL algorithms is highly sensitive to memory, transfer, and information structure.

Lifelong RL with Experience Retention: In (Xie et al., 2021), simulation on the ROBEL D’Claw benchmark demonstrates that sequential replay and relabeling achieves ~82% final success after 50K steps per task, compared to baseline ~85% final success with 100K steps per task—a halving of the environment interactions required for similar performance. On a physical robot (Franka Emika Panda), 10 sequential manipulation tasks trained with the method achieve nearly 2× reduction in average end-distance to goal compared to training from scratch with identical budgets, substantiating strong forward transfer and sample efficiency.

Information Structure and Separator-Driven Bounds: In (Altabaa et al., 1 Mar 2024), the statistical hardness of learning a sequential system is linked to a graph-theoretic measure—the minimal $d$ -separator $\mathcal I_h^\dagger$ size in the DAG representation (edges into action variables removed). The main sample complexity theorem establishes that the effective rank $r=\max_h |\bbI_h^\dagger|$ bounds the number of episodes to

$\widetilde O\!\left( r + \frac{Q_A^2 H}{\gamma^2}\, \frac{r\,d\,H^3\,\|\bar\Theta_\varepsilon\|\max_{s\in A}|X_s|^2\,Q_A^4}{\gamma^4\,\epsilon^2} \right)$

episodes, unifying previous results by showing tractability whenever the minimal separator is small—recovering the classical UCRL sample complexity for MDPs and polynomial bounds for weakly-revealing POMDPs.

A plausible implication is that explicit modeling of information structure (via POST/POSG) allows identification and exploitation of new tractable subclasses beyond Markov chain models, including mean-field, limited-memory, or communication-constrained formulations.

4. Model Architectures and Learning Algorithms

Sequential RL implementations are typically structured around off-policy or on-policy RL (SAC, PPO) with auxiliary mechanisms for temporal abstraction, retention, and transfer.

Model-Free Off-Policy RL with Importance Sampling: As detailed in (Xie et al., 2021), the approach maintains task-specific SAC actors and critics with task-specific replay buffers. Pre-training uses all relabeled historical data, while online learning phases interleave new and filtered past samples, with regular classifier retraining to prevent distributional drift.

Blockwise and Attention-Based Sequential Modeling: For partially observable settings (Park et al., 2021), an episode is divided into fixed-length blocks, within which self-attention is applied to model intra-block temporal dependencies. A latent summary vector produced by aggregating the most salient positions (top- $k$ selection) is propagated via a blockwise RNN; per-step RL features are then constructed via a separate RNN conditioned on the previous block's latent. Training of the generative model and policy proceeds via self-normalized importance sampling (SNIS) to estimate the gradient of the log-likelihood and reinforce stable learning.

Hierarchical, Parameterized RL with Adaptive Control: In the IMP-HRL framework (Tahmaz et al., 27 Aug 2025), a two-level parameterized action MDP is employed. High-level policies select among learned primitives, which are then parameterized with target poses and stiffness settings. A closed-loop, adaptive impedance controller modulates compliance during execution, with primitive selection and parameter output trained using off-policy SAC and affine reward coupling to explore compliant, safe action spaces.

5. Applications and Empirical Results

Sequential RL methods have demonstrated compelling results in a range of tasks:

Robotic Skill Acquisition: Sequential, sample-efficient skill acquisition for physically instantiated robots—e.g., the Franka Emika Panda achieving near-perfect manipulation on ten compound tasks, including insertion and capping, each requiring only 10K interactions due to retained and relabeled experience (Xie et al., 2021).
Long-Horizon Sequential Manipulation: Hierarchical frameworks leveraging sequential composition and online adaptation guarantee high task success, improved sample efficiency, and reduced contact forces in manipulation benchmarks (block lifting, door opening, surface cleaning); real-world hardware demonstrates sim-to-real transfer without additional fine-tuning (Tahmaz et al., 27 Aug 2025).
Sequential Scene Generation and Experiment Design: Sequential RL enables generation of complex structured data (e.g., scenes, levels) and sequential experimental designs with substantial improvements in sample efficiency and plausibility, as evidenced in indoor layout and level design tasks (Ostonov et al., 2022, Shen et al., 2023).

6. Strengths, Limitations, and Open Challenges

Strengths of sequential RL frameworks include:

Data efficiency via transfer: Repurposing of past experience and pre-training reduce required interactions for each new task by up to $2\times$ or more in practical settings (Xie et al., 2021).
Adaptivity and compositionality: Hierarchical and blockwise models adapt to multi-scale temporal dependencies while maintaining performance and interpretability (Tahmaz et al., 27 Aug 2025, Park et al., 2021).
Principled sample complexity analysis: Explicit representations of information structure enable theoretical bounds and clear identification of efficient subclasses (Altabaa et al., 1 Mar 2024).

Limitations include:

Unbounded memory growth: Replay buffer size in lifelong RL grows with the number of sequential tasks under the assumption of unlimited storage (Xie et al., 2021).
Dependence on explicit reward access: Data relabeling for transfer requires that the reward function for every future task be accessible in closed form (Xie et al., 2021).
Potential classifier failures: Misestimation or overfitting of domain classifiers may introduce bias or hinder transfer for highly dissimilar tasks (Xie et al., 2021).
Restriction to kinematic or "toy" settings: Many benchmarks do not extend to vision-based or contact-rich real-world domains, which may introduce additional challenges (Xie et al., 2021, Tahmaz et al., 27 Aug 2025).

A plausible implication is that future sequential RL research will require efficient buffer management, extension of transfer mechanisms to settings without closed-form reward access, reliable classifier calibration, and direct handling of rich sensory input.

7. Future Directions and Integration with Broader RL Frameworks

Current research suggests promising avenues for sequential RL:

Theory-informed model architectures: Leveraging information-theoretic analysis of DAG-structured environments to design scalable policies with provable guarantees (Altabaa et al., 1 Mar 2024).
Hierarchical and multi-resolution modeling: Adaptive scheduling of timeline granularity in blockwise models, and advances in hierarchical skill discovery for unbounded task or environment sequences (Park et al., 2021, Tahmaz et al., 27 Aug 2025).
Bridging RL and Bayesian design: Formalizing connections between Bayesian sequential design and RL as in (Tec et al., 2022), to incorporate uncertainty-aware stopping, boundary optimization, and risk modeling.
Sample-efficient and transferable sequential RL: Unbiased and computationally efficient gradient estimation, effective buffer replay strategies, and model selection for open-ended task sequences.
Extension to multi-agent and decentralized settings: Generalizing sequential RL frameworks to team and game-theoretic structures, exploiting explicit information structure for scalable learning.

In summary, sequential RL encompasses a broad and highly active domain unifying algorithmic, architectural, and theoretical advances for temporally ordered and resource-constrained learning. The formalization of temporality—not as stepwise action selection but as a property of data, task, or information structure—is central to enabling general and sample-efficient RL in lifelong, partially observable, and real-world domains.