LSTM-based Recurrent Policies

Updated 5 March 2026

LSTM-based recurrent policies are temporal function approximators that use internal memory to encode long histories of observations and actions in sequential decision making.
They integrate gating mechanisms and backpropagation through time to handle partial observability and long-range dependencies in reinforcement learning and control tasks.
Innovative architectures and training techniques, such as hierarchical LSTMs and gradient clipping, improve stability and performance across diverse applications including robotics and finance.

Long Short-Term Memory (LSTM)-based recurrent policies are a class of temporal function approximators for sequential decision making, where the policy or value function is parameterized by an LSTM recurrent neural network. LSTM-based policies are explicitly designed to handle long-range temporal dependencies and partial observability in stochastic control and reinforcement learning (RL), leveraging the gating and memory cell structure of LSTMs to encode information from arbitrarily long histories of observations and actions.

1. Fundamental Principles of LSTM-based Recurrent Policies

LSTM-based policies replace Markovian policies $\pi(a_t|s_t)$ with functions of the history, encoding the partial-observation sequence via the LSTM’s internal state. The essential architectural element is the LSTM cell, defined by input, forget, and output gates; at each time $t$ , the LSTM ingests the current input, previous hidden state, and previous cell state, producing a new memory summary that parameterizes either the action distribution (policy) or value estimate.

Let $x_t$ denote the per-step input to the LSTM (e.g., current observation $o_t$ , prior action $a_{t-1}$ ). The recurrence is: $\begin{aligned} f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f) \ i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i) \ o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o) \ \tilde{c}_t &= \tanh(W_c x_t + U_c h_{t-1} + b_c) \ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \ h_t &= o_t \odot \tanh(c_t) \end{aligned}$ The final hidden state $h_t$ is mapped to the policy output, e.g., $\pi_\theta(a_t | h_t)$ or $Q_\omega(h_t, a_t)$ .

2. Architectures and Representative Algorithms

LSTM-based recurrent policies have been developed for both model-free RL (off-policy and on-policy), adaptive control, stochastic control with delay, and trading. Core architectures include:

Model-free RL:
- Recurrent DPG (RDPG) and Recurrent SAC (RSAC) prepend one or more LSTM layers to the standard multilayer perceptron (MLP) actor-critic backbone. In each case, the LSTM outputs are used as hidden summaries to condition policy/actions and Q-values (Heess et al., 2015, Yang et al., 2021).
- Deep Recurrent Q-Network (DRQN) replaces the first post-convolutional layer of DQN with an LSTM, consuming per-frame features instead of frame stacks (Hausknecht et al., 2015).
- LSTM-TD3: Both actor and critic ingest recent L observation–action pairs with an LSTM that fuses memory and current-feature MLPs, supporting robust partially observable RL (POMDPs) (Meng et al., 2021).
- Perception–Prediction–Reaction (PPR) Agent: Multicore hierarchical LSTM structure, incorporating a slow-ticking core for long-term memory and three fast-ticking LSTMs, combined by KL-regularization of the induced policies (Stooke et al., 2020).
Hybrid methods:
- Supervised-LSTM for state representation is jointly trained with a DQN head (hybrid SL+RL objective), explicitly learning hidden state representations from sequence data (Li et al., 2015).
Adaptive control:
- LSTM augments standard feedforward adaptive neural network (ANN) controllers, providing rapid correction for transients, improving response to abrupt changes, and supporting formal UUB stability results via Lyapunov analysis (Inanc et al., 2023).
Continuous-time control with memory:
- LSTM parameterizes policies for stochastic control with delay, outperforming finite window-based feedforward nets, especially in infinite-delay tasks (Han et al., 2021).
Financial RL:
- Policy parameterized as an L-layer LSTM stack, mapping raw return histories to continuous trade signals; optimized by directly maximizing risk-adjusted objectives (e.g., Sharpe ratio) (Lu, 2017).

3. Training Methodologies and Optimization

Training of LSTM-based policies employs backpropagation through time (BPTT) over rollouts, with objectives and update rules depending on the base algorithm:

Objective Class	Example Algorithms	Loss / Update Mechanism
Policy gradient / Actor-Critic	RDPG, RSAC, RTD3, LSTM-TD3	Policy or actor loss is computed along LSTM-unrolled sequences, optionally with entropy terms; Q-functions/Bellman error for critics; soft target network updates (Yang et al., 2021, Meng et al., 2021).
Value-based (Q-Learning)	DRQN, Hybrid RL–LSTM DQN	Temporal-difference targets, sequence-based batches, clipped gradients, target network for stability (Hausknecht et al., 2015, Li et al., 2015).
Direct risk-sensitive	LSTM-trader	Loss corresponds to Sharpe or Downside Deviation Ratio; gradients via BPTT, with regularization (Lu, 2017).
Adaptive control	Hybrid ANN + LSTM	Lyapunov-driven policy update, LSTM trained via MSE to predict residual errors, alternating with ANN adaptation (Inanc et al., 2023).
Stochastic control with delay	LSTM delayed-control	Loss corresponds to simulated cost-to-go or utility; optimized by BPTT and Adam (Han et al., 2021).

Key practical elements include experience replay buffers storing entire episodes or subsequences, zero-initialization of hidden states when sampling, gradient clipping to stabilize recurrent optimization, and layer normalization for LSTM parameters (Heess et al., 2015, Yang et al., 2021). Layer stacking and non-recurrent input dropout are additional common engineering choices (Lu, 2017).

4. Empirical Performance and Observed Benefits

LSTM-based recurrent policies consistently match or outperform comparable feedforward and non-recurrent policies in settings where partial observability, path dependence, or long-term credit assignment is present:

In continuous control POMDPs with missing/noisy observations, LSTM-equipped agents resolve unobserved states, infer dynamics, and integrate temporal context (Heess et al., 2015, Meng et al., 2021).
Deterministic recurrent off-policy methods (RDPG, RTD3) may struggle with long-term credit assignment and exploration, whereas stochastic policies (RSAC) with LSTM memory achieve more robust performance on memory-demanding benchmarks (Yang et al., 2021).
In adaptive control, LSTM augmentation of adaptive NNs yields dramatically reduced overshoot, faster settling (<1s, 70% RMS tracking error reduction), and superior transient compensation of sharp dynamic changes (Inanc et al., 2023).
For stochastic control with delay, LSTM policies converge more quickly, exhibit lower variance, and match analytical optima even with infinite memory, unlike fixed-window feedforward nets (Han et al., 2021).
In high-frequency trading, LSTM-based trading agents discover profitable and low-variance strategies, outperforming shallow recurrent baselines and enabling position inertia (Lu, 2017).
Designed architectures like the PPR agent leverage hierarchical LSTMs plus auxiliary KL losses to sharply reduce sample complexity and raise asymptotic scores in multi-task, high-memory environments (Stooke et al., 2020).
Value-based DRQN shows greater stability than DQN in flickering/missing observation regimes, facilitating learning without explicit frame stacking, and delivering robust generalization to new observation regimes (Hausknecht et al., 2015).

5. Theoretical Guarantees and Analysis

Several works provide formal and empirical support for the stability and effectiveness of LSTM-based recurrent policies:

Lyapunov-based guarantees: In adaptive control, integration of LSTM residual predictors with ANN controllers yields uniform ultimate boundedness for all plant and network signals, enforced by additional robustifying control laws and Lyapunov analysis (Inanc et al., 2023).
Gradient stability: The LSTM gating structure mitigates vanishing/exploding gradients, permitting BPTT over long sequences and supporting learning in tasks requiring memory over hundreds of steps (Heess et al., 2015, Han et al., 2021, Yang et al., 2021).
Auxiliary objectives for representation: Hybrid supervised+RL architectures and KL-regularized auxiliary losses (as in the PPR agent) guide the internal memory towards compressive and predictive state summaries, enhancing generalization and representation learning (Stooke et al., 2020, Li et al., 2015).
Empirical ablations demonstrate that removing memory, switching to windowed input, or omitting past actions from the LSTM input sharply degrades performance, confirming the necessity of recurrent structure for long-range temporal credit assignment and POMDP reasoning (Meng et al., 2021, Yang et al., 2021).

6. Architectural Innovations and Design Considerations

Architectural design choices in LSTM-based recurrent policies include:

Depth: Single versus multi-layer LSTM stacks; two-layer configurations (hidden size 256) are common in high-dimensional RL (Yang et al., 2021).
Fusion strategies: Fusing LSTM memory summaries with current features via additional MLPs in actor–critic architectures substantially improves both MDP and POMDP performance over naïve memory integration (Meng et al., 2021).
Hierarchical memory: Temporal hierarchies (e.g., slow-ticking cores in PPR), parameter sharing, and auxiliary consistency losses further enhance long-term retention and sample efficiency (Stooke et al., 2020).
Hybrid modules: Augmenting feedforward policy heads or base controllers (ANNs) with LSTM-driven correction policies enables better handling of high-frequency or abrupt dynamics (Inanc et al., 2023).
Regularization: Non-recurrent input dropout, BPTT gradient clipping, and normalization are essential for stable LSTM training in RL.

A recurring theme is that LSTM-based recurrence allows differentiable policies to learn the minimal sufficient statistic—compressing the entire observation-action history into a fixed-dimensional latent state on which to condition decisions—without hand-crafted memory windows or explicit state modeling (Heess et al., 2015, Li et al., 2015).

7. Applications, Limitations, and Future Directions

LSTM-based recurrent policies are now standard in deep RL for robotics, adaptive control, finance, and sequential decision problems with incomplete state information or latent system parameters. Their application spans memory-based locomotion, high-frequency autonomous trading, resource allocation in customer relationship management, and continuous-time stochastic control with delay.

Limitations include the interpretability of learned memory states, the potential sensitivity to hyperparameters such as history truncation length, and computational overhead for long sequence unrolls during training (Meng et al., 2021). Extensions include regularized or disentangled representations of memory, dynamic memory-length adaptation, hierarchical module design, combination with explicit Bayesian filtering, and more robust exploration via stochastic policy classes (Yang et al., 2021, Stooke et al., 2020).

In sum, LSTM-based recurrent policies provide a scalable, model-free, end-to-end method for deep temporal abstraction and control, representing a key advance for RL and adaptive sequential decision-making in partially observed and memory-intensive environments.