LSTM-TD3: Memory-Based DRL for POMDPs

Updated 9 October 2025

LSTM-TD3 is a memory-based deep reinforcement learning algorithm that integrates LSTM modules with TD3 to infer latent states in partially observable environments.
It employs a modular architecture that separately processes historical trajectories and current observations, leading to improved performance over naive history concatenation.
Empirical evaluations demonstrate that LSTM-TD3 outperforms traditional methods in POMDPs, making it highly relevant for robotics and autonomous control tasks.

LSTM-TD3 is a memory-based deep reinforcement learning (DRL) algorithm designed to solve continuous control problems under partial observability. It builds on the Twin Delayed Deep Deterministic Policy Gradient (TD3) architecture by augmenting both actor and critic networks with Long Short-Term Memory (LSTM) modules, enabling robust inference of hidden states and more effective handling of noisy, incomplete, or missing observations. The LSTM-TD3 and its variants present distinct architectural innovations and empirical advantages in Partially Observable Markov Decision Processes (POMDPs) compared to naive history concatenation and traditional TD3.

1. Modular Architecture and Algorithmic Framework

LSTM-TD3 employs a recurrent actor–critic paradigm wherein both policy and value functions are implemented with architecturally modular networks:

Memory Extraction Module ( $Q^{me}$ , $\mu^{me}$ ): An LSTM sub-network processes a trajectory segment comprising the previous $l$ (observation, action) pairs, denoted $h_t^l$ . Zero vectors are used when the history is unavailable at initial steps.
Current Feature Extraction Module ( $Q^{cf}$ , $\mu^{cf}$ ): Parallel sub-networks extract state or state-action features from the instantaneous observation–action pair $(o_t, a_t)$ .
Perception Integration Module ( $Q^{pi}$ , $\mu^{pi}$ ): Merges the outputs of the memory and current feature extraction via concatenation, producing a latent encoding used for policy output or value estimation.

The network composition for the critic is:

$Q(o_t, a_t, h_t^l) = Q^{pi}(Q^{me}(h_t^l) \mathbin{\|} Q^{cf}(o_t, a_t))$

and for the actor:

$\mu(o_t, h_t^l) = \mu^{pi}(\mu^{me}(h_t^l) \mathbin{\|} \mu^{cf}(o_t))$

where $\mathbin{\|}$ represents feature concatenation.

This modularity enables explicit separation of temporal and instantaneous information, a design shown to outperform naive stacking of histories (Meng et al., 2021).

2. Recurrent Memory and Sequence Encoding

The introduction of an LSTM-based memory component endows the agent with the ability to extract temporal dependencies and infer latent states. For a history length $l$ , the memory module processes $h_t^l = [(o_{t-l}, a_{t-l}), \dots, (o_{t-1}, a_{t-1})]$ and encodes a latent representation that approximates the underlying hidden state $s_t$ in a POMDP. When the input history includes both observations and actions, the LSTM can infer the system’s dynamics and compensate for temporally missing or corrupted features.

In partially observable settings (e.g., POMDP-RV, POMDP-FLK, POMDP-RSM, POMDP-RN test scenarios), the ability of the LSTM to integrate past informative signals improves estimation of the agent's belief state and subsequent policy decisions. Ablation studies confirm that omitting past actions or current feature extraction degrades robustness and performance, highlighting the necessity of structured sequential processing over ad hoc concatenation (Meng et al., 2021, Omi et al., 2023).

3. Optimization Objective and Training Dynamics

The critic networks are optimized via the twin delayed mechanism, minimizing the mean squared error with targets smoothed by double critics and policy noise:

$\min_{\theta^{(Q_j)}} \mathbb{E} \left[ (Q_j(o_t, a_t, h_t^l) - \hat{Y})^2 \right]$

where

$\hat{Y} = r_t + \gamma (1 - d_t) \min\{ Q_1^-(o_{t+1}, a^-, h_{t+1}^l), Q_2^-(o_{t+1}, a^-, h_{t+1}^l) \}$

and $a^- = \mu^-(o_{t+1}, h_{t+1}^l) +$ clipped noise.

The actor is updated to maximize the expected $Q$ value:

$\max_{\theta^{(\mu)}} \mathbb{E} \left[ Q(o_t, \mu(o_t, h_t^l), h_t^l) \right]$

The use of double critics and target policy smoothing—carried over from TD3—notably reduces overestimation errors frequently encountered in actor–critic algorithms (Meng et al., 2021).

4. Empirical Performance Across MDPs and POMDPs

Comprehensive evaluation in PyBulletGym environments demonstrates that LSTM-TD3 is competitive with TD3, DDPG, and SAC in fully observable MDPs; yet it substantially outperforms these baselines in POMDPs with missing, noisy, or randomly flickered observations (Meng et al., 2021):

Environment Type	TD3 vs. LSTM-TD3	TD3-OW vs. LSTM-TD3
MDP (no occlusion/noise)	Comparable performance	LSTM-TD3 competitive
POMDP (missing/zeroed/corrupted)	LSTM-TD3 superior	LSTM-TD3 superior

Simply concatenating a window of past observations and actions (TD3-OW) cannot match the structured sequential abstraction provided by the recurrent LSTM module, affirming the significance of dedicated temporal encoding.

Ablation studies further show that removing current feature extraction or not including past actions in history reduces effectiveness. The optimal choice of history length $l$ is domain-specific; longer histories yield more stable latent states but increase computational complexity.

5. Robustness and Practical Applicability

LSTM-TD3’s design is especially impactful for real-world settings characterized by partial observability. In robotics, sensor limitations, noise, and occlusions frequently render the state only partially accessible. The recurrent memory enables effective handling of scenarios such as temporary sensor occlusion (e.g., POMDP-FLK), noisy measurements (POMDP-RN), or intermittent data loss.

Domains directly benefitting from LSTM-TD3 include:

Robot navigation and manipulation under sensory uncertainty.
Autonomous vehicle control where sensors may be unreliable or incomplete.
Human–robot and multi-agent interaction environments with information asymmetry.

The method’s modular and recurrent architecture offers a blueprint for robust controller design under practical non-idealities (Meng et al., 2021, Omi et al., 2023).

6. Architectural and Algorithmic Variants

Recent studies propose several structural variants to further improve robustness and computational efficiency (Omi et al., 2023):

Single-head channel architectures (LSTM-TD3 $_{1ha1hc}$ , LSTM-TD3 $_{1ha2hc}$ ): Rather than prioritizing the current observation in a separate input, processing the entire sequence in a unified channel augments dynamic estimation and improves convergence.
Hidden-state sharing (H-TD3): Actor LSTM’s internal states are carried over and used to initialize the critic LSTM at training, reducing the need for repeated full trajectory reprocessing and thereby decreasing computational overhead.

Experimental results show that including past actions and flexible sequence length selection are critical for sample-efficient and robust learning—especially under dynamic disturbance patterns, such as temporal sinusoidal waves or random noise added to observations. Architectures that unify processing of past and current information generally achieve superior performance.

7. Challenges, Limitations, and Future Directions

Key challenges remain in optimal selection of history length $l$ for trajectory encoding, efficient separation of current and historical features, and mitigation of overestimation bias. LSTM-TD3’s solution involves modularizing network inputs and leveraging double critics and target policy smoothing.

The approach demonstrates pathology when the history window is overly large, which can cause redundancy and escalate computation. Conversely, too short a window may fail to encode necessary temporal dependencies. The treatment of $l$ as a hyperparameter and empirical adjustment remains essential.

Recent developments suggest integration of additional uncertainty-handling mechanisms (e.g., dynamic exploration–robustness schedules as in ADER (Zhou et al., 2021)) or variance-reduction strategies (e.g., Taylor expansions (Garibbo et al., 2023)) may complement or be synergistically combined with LSTM-TD3 in memory-based RL architectures.

Taken together, the LSTM-TD3 family represents a principled, empirically validated solution to deep reinforcement learning under partial observability, with substantial implications for robust real-world decision making and control.