Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 79 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 117 tok/s Pro
Kimi K2 201 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

LSTM-TD3: Memory-Based DRL for POMDPs

Updated 9 October 2025
  • LSTM-TD3 is a memory-based deep reinforcement learning algorithm that integrates LSTM modules with TD3 to infer latent states in partially observable environments.
  • It employs a modular architecture that separately processes historical trajectories and current observations, leading to improved performance over naive history concatenation.
  • Empirical evaluations demonstrate that LSTM-TD3 outperforms traditional methods in POMDPs, making it highly relevant for robotics and autonomous control tasks.

LSTM-TD3 is a memory-based deep reinforcement learning (DRL) algorithm designed to solve continuous control problems under partial observability. It builds on the Twin Delayed Deep Deterministic Policy Gradient (TD3) architecture by augmenting both actor and critic networks with Long Short-Term Memory (LSTM) modules, enabling robust inference of hidden states and more effective handling of noisy, incomplete, or missing observations. The LSTM-TD3 and its variants present distinct architectural innovations and empirical advantages in Partially Observable Markov Decision Processes (POMDPs) compared to naive history concatenation and traditional TD3.

1. Modular Architecture and Algorithmic Framework

LSTM-TD3 employs a recurrent actor–critic paradigm wherein both policy and value functions are implemented with architecturally modular networks:

  • Memory Extraction Module (QmeQ^{me}, μme\mu^{me}): An LSTM sub-network processes a trajectory segment comprising the previous ll (observation, action) pairs, denoted htlh_t^l. Zero vectors are used when the history is unavailable at initial steps.
  • Current Feature Extraction Module (QcfQ^{cf}, μcf\mu^{cf}): Parallel sub-networks extract state or state-action features from the instantaneous observation–action pair (ot,at)(o_t, a_t).
  • Perception Integration Module (QpiQ^{pi}, μpi\mu^{pi}): Merges the outputs of the memory and current feature extraction via concatenation, producing a latent encoding used for policy output or value estimation.

The network composition for the critic is:

Q(ot,at,htl)=Qpi(Qme(htl)Qcf(ot,at))Q(o_t, a_t, h_t^l) = Q^{pi}(Q^{me}(h_t^l) \mathbin{\|} Q^{cf}(o_t, a_t))

and for the actor:

μ(ot,htl)=μpi(μme(htl)μcf(ot))\mu(o_t, h_t^l) = \mu^{pi}(\mu^{me}(h_t^l) \mathbin{\|} \mu^{cf}(o_t))

where \mathbin{\|} represents feature concatenation.

This modularity enables explicit separation of temporal and instantaneous information, a design shown to outperform naive stacking of histories (Meng et al., 2021).

2. Recurrent Memory and Sequence Encoding

The introduction of an LSTM-based memory component endows the agent with the ability to extract temporal dependencies and infer latent states. For a history length ll, the memory module processes htl=[(otl,atl),,(ot1,at1)]h_t^l = [(o_{t-l}, a_{t-l}), \dots, (o_{t-1}, a_{t-1})] and encodes a latent representation that approximates the underlying hidden state sts_t in a POMDP. When the input history includes both observations and actions, the LSTM can infer the system’s dynamics and compensate for temporally missing or corrupted features.

In partially observable settings (e.g., POMDP-RV, POMDP-FLK, POMDP-RSM, POMDP-RN test scenarios), the ability of the LSTM to integrate past informative signals improves estimation of the agent's belief state and subsequent policy decisions. Ablation studies confirm that omitting past actions or current feature extraction degrades robustness and performance, highlighting the necessity of structured sequential processing over ad hoc concatenation (Meng et al., 2021, Omi et al., 2023).

3. Optimization Objective and Training Dynamics

The critic networks are optimized via the twin delayed mechanism, minimizing the mean squared error with targets smoothed by double critics and policy noise:

minθ(Qj)E[(Qj(ot,at,htl)Y^)2]\min_{\theta^{(Q_j)}} \mathbb{E} \left[ (Q_j(o_t, a_t, h_t^l) - \hat{Y})^2 \right]

where

Y^=rt+γ(1dt)min{Q1(ot+1,a,ht+1l),Q2(ot+1,a,ht+1l)}\hat{Y} = r_t + \gamma (1 - d_t) \min\{ Q_1^-(o_{t+1}, a^-, h_{t+1}^l), Q_2^-(o_{t+1}, a^-, h_{t+1}^l) \}

and a=μ(ot+1,ht+1l)+a^- = \mu^-(o_{t+1}, h_{t+1}^l) + clipped noise.

The actor is updated to maximize the expected QQ value:

maxθ(μ)E[Q(ot,μ(ot,htl),htl)]\max_{\theta^{(\mu)}} \mathbb{E} \left[ Q(o_t, \mu(o_t, h_t^l), h_t^l) \right]

The use of double critics and target policy smoothing—carried over from TD3—notably reduces overestimation errors frequently encountered in actor–critic algorithms (Meng et al., 2021).

4. Empirical Performance Across MDPs and POMDPs

Comprehensive evaluation in PyBulletGym environments demonstrates that LSTM-TD3 is competitive with TD3, DDPG, and SAC in fully observable MDPs; yet it substantially outperforms these baselines in POMDPs with missing, noisy, or randomly flickered observations (Meng et al., 2021):

Environment Type TD3 vs. LSTM-TD3 TD3-OW vs. LSTM-TD3
MDP (no occlusion/noise) Comparable performance LSTM-TD3 competitive
POMDP (missing/zeroed/corrupted) LSTM-TD3 superior LSTM-TD3 superior

Simply concatenating a window of past observations and actions (TD3-OW) cannot match the structured sequential abstraction provided by the recurrent LSTM module, affirming the significance of dedicated temporal encoding.

Ablation studies further show that removing current feature extraction or not including past actions in history reduces effectiveness. The optimal choice of history length ll is domain-specific; longer histories yield more stable latent states but increase computational complexity.

5. Robustness and Practical Applicability

LSTM-TD3’s design is especially impactful for real-world settings characterized by partial observability. In robotics, sensor limitations, noise, and occlusions frequently render the state only partially accessible. The recurrent memory enables effective handling of scenarios such as temporary sensor occlusion (e.g., POMDP-FLK), noisy measurements (POMDP-RN), or intermittent data loss.

Domains directly benefitting from LSTM-TD3 include:

  • Robot navigation and manipulation under sensory uncertainty.
  • Autonomous vehicle control where sensors may be unreliable or incomplete.
  • Human–robot and multi-agent interaction environments with information asymmetry.

The method’s modular and recurrent architecture offers a blueprint for robust controller design under practical non-idealities (Meng et al., 2021, Omi et al., 2023).

6. Architectural and Algorithmic Variants

Recent studies propose several structural variants to further improve robustness and computational efficiency (Omi et al., 2023):

  • Single-head channel architectures (LSTM-TD31ha1hc_{1ha1hc}, LSTM-TD31ha2hc_{1ha2hc}): Rather than prioritizing the current observation in a separate input, processing the entire sequence in a unified channel augments dynamic estimation and improves convergence.
  • Hidden-state sharing (H-TD3): Actor LSTM’s internal states are carried over and used to initialize the critic LSTM at training, reducing the need for repeated full trajectory reprocessing and thereby decreasing computational overhead.

Experimental results show that including past actions and flexible sequence length selection are critical for sample-efficient and robust learning—especially under dynamic disturbance patterns, such as temporal sinusoidal waves or random noise added to observations. Architectures that unify processing of past and current information generally achieve superior performance.

7. Challenges, Limitations, and Future Directions

Key challenges remain in optimal selection of history length ll for trajectory encoding, efficient separation of current and historical features, and mitigation of overestimation bias. LSTM-TD3’s solution involves modularizing network inputs and leveraging double critics and target policy smoothing.

The approach demonstrates pathology when the history window is overly large, which can cause redundancy and escalate computation. Conversely, too short a window may fail to encode necessary temporal dependencies. The treatment of ll as a hyperparameter and empirical adjustment remains essential.

Recent developments suggest integration of additional uncertainty-handling mechanisms (e.g., dynamic exploration–robustness schedules as in ADER (Zhou et al., 2021)) or variance-reduction strategies (e.g., Taylor expansions (Garibbo et al., 2023)) may complement or be synergistically combined with LSTM-TD3 in memory-based RL architectures.

Taken together, the LSTM-TD3 family represents a principled, empirically validated solution to deep reinforcement learning under partial observability, with substantial implications for robust real-world decision making and control.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LSTM-TD3.