Recurrent Deterministic Policy Gradient (RDPG)

Updated 5 March 2026

RDPG is a deep reinforcement learning method that extends deterministic policy gradients with RNNs to handle partial observability in continuous control tasks.
It leverages backpropagation through time, full-episode experience replay, and target networks to optimize actor-critic architectures for improved stability.
Empirical results show RDPG outperforms non-recurrent baselines, achieving higher performance and sample efficiency in environments with memory and noise challenges.

The Recurrent Deterministic Policy Gradient (RDPG) algorithm is a model-free deep reinforcement learning method designed for partially observable environments. RDPG extends Deterministic Policy Gradient (DPG) by leveraging recurrent neural networks (RNNs), typically LSTM or GRU, to maintain a memory of the agent’s interaction history. This endows RDPG with the capacity to infer hidden state information and integrate temporal dependencies, providing a robust framework for continuous control tasks where full observability is absent. RDPG utilizes backpropagation through time (BPTT) to optimize both the actor and the critic networks, incorporates experience replay buffers containing entire episodes for sample efficiency, and employs target networks for stabilized learning. Empirically, RDPG achieves superior performance over non-recurrent baselines such as DDPG in domains with significant history dependence or sensor/observation noise (Heess et al., 2015, Yang et al., 2021, Song et al., 2017, Gharehgoli et al., 2022).

1. RDPG in the Partially Observable Markov Decision Process

RDPG addresses the Partially Observable Markov Decision Process (POMDP) formalism, where at each time step $t$ the agent observes $o_t$ (as opposed to the full state $s_t$ ) and must integrate a history $h_t = (o_1, a_1, o_2, ..., o_t)$ into a latent representation via an RNN. A deterministic recurrent policy $\mu_\theta$ maps the internal hidden state $h_t$ to actions $a_t = \mu_\theta(h_t)$ . The action-value function is defined as $Q^\mu(h_t, a_t)$ , predicting expected return from history $h_t$ and action $a_t$ under current policy. RDPG enables off-policy updates, using episode-level replay to preserve temporal consistency across experience (Heess et al., 2015, Song et al., 2017, Yang et al., 2021, Gharehgoli et al., 2022).

2. Algorithmic Structure and Optimization

The core of RDPG is the extension of the deterministic policy gradient theorem to recurrent structures, resulting in actor and critic networks that operate over entire sequences. The update for policy parameters $\theta$ takes the form:

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\mathcal{D}} \left[ \sum_{t=0}^{T-1} \gamma^t\, \nabla_\theta \mu_\theta(h_t)\, \nabla_a Q^\mu(h_t, a)\bigr|_{a=\mu_\theta(h_t)} \right]$

with the critic $Q_\omega$ trained to minimize the Bellman error:

$L(\omega) = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} [y_t^{(i)} - Q_\omega(h_t^{(i)}, a_t^{(i)})]^2$

where $y_t = r_t + \gamma Q_{\omega'}(h_{t+1}, \mu_{\theta'}(h_{t+1}))$ and ( $\theta', \omega'$ ) are slowly-updated target network parameters. Backpropagation through time is employed for all updates. Experience replay buffers store full episodes (or sub-trajectories with special tracking of RNN state), ensuring that each sampled transition is temporally coherent (Heess et al., 2015, Yang et al., 2021, Song et al., 2017, Gharehgoli et al., 2022).

3. Neural Architectures and Design Variants

RDPG implementations employ deep recurrent architectures for both actor and critic:

Recurrent core: LSTM or GRU stacks (typically 1–2 layers, 128–512 hidden units) ingest current observations and propagate hidden states across each episode.
Feedforward heads: Outputs from recurrent cores are passed to MLP heads (commonly two layers with 64–512 units, ReLU activations), with final activations being tanh or linear depending on action space constraints.
Network variants: Pixel-based tasks utilize convolutional front-ends preceding the recurrent layers; proprioceptive input tasks concatenate observations and prior actions as inputs.
Target networks: Identical architectures for target actors/critics, updated by Polyak averaging after each training iteration ( $\tau$ in $[0.001, 0.01]$ ) (Heess et al., 2015, Yang et al., 2021, Song et al., 2017, Gharehgoli et al., 2022).

4. Training Protocols and Stabilization Strategies

Effective RDPG training requires careful handling of recurrent state and temporal dependencies:

Replay buffer: Storage of full episodes (buffer size ranging from $5\times10^3$ to $6\times10^5$ episodes), essential for accurate reconstruction of history-dependent hidden states.
Sequence initialization: For sub-trajectory sampling, hidden states are initialized either to zero or, for improved stability, via a preview or scan of prior observations (“scanning” technique) (Song et al., 2017).
Gradient propagation: BPTT across sampled sequences, optionally truncated for computational efficiency. Gradient norm clipping (e.g., 0.5–1) is essential to mitigate vanishing or exploding gradients, particularly for very long episodes.
Exploration: Additive Gaussian or Ornstein–Uhlenbeck noise on the policy output.
Input normalization and burn-in: Observations (including CSI/demand in network slicing) are normalized, and burn-in steps are sometimes used for LSTM state stabilization.
Optimization: Typical learning rates for Adam are $10^{-3}$ for the critic, $10^{-4}$ for the actor, decayed linearly or held constant.
Soft-update targets: Polyak averaging (e.g., $\tau=0.001$ ) for stable target network tracking (Heess et al., 2015, Yang et al., 2021, Gharehgoli et al., 2022).

5. Extensions and Methodological Innovations

Studies have introduced several enhancements to canonical RDPG:

Tail-step bootstrap of interpolated TD: Blending $n$ -step and $\lambda$ -interpolated TD targets at each position within a sampled trajectory slice to reduce bias and variance in the critic update.
Hidden-state initialization via trajectory scanning: Feeding prior observations into the RNN before training steps to approximate the correct hidden state context for each slice, avoiding the cold-start artifact of zero initialization.
Experience injection: Augmenting replay buffer with trajectories produced by external (teacher) policies to accelerate behavior diversification and avoid monotonicity in policy learning. This is annealed over training to maintain on-policy relevance (Song et al., 2017).

6. Empirical Performance and Benchmarking

RDPG achieves consistent advantages over non-recurrent baselines (e.g., DDPG) in partially observable domains:

In continuous control with memory requirements (e.g., memory-Reacher, Walker/Hopper with sensory dropout, BipedalWalker with rough terrain), RDPG reliably outperforms feedforward approaches, reaching asymptotic performance with 2–4x greater sample efficiency (Yang et al., 2021, Song et al., 2017, Heess et al., 2015).
In end-to-end network slicing for 5G+ under demand and CSI uncertainty, RDPG yields up to 65–100% higher infrastructure provider utility than state-of-the-art SAC and DDPG competitors, retaining 80% of its performance under 30% demand uncertainty compared to SAC’s 40%. RDPG converges in approximately 2000 episodes, compared to over 3000 for SAC and 3500 for DDPG (Gharehgoli et al., 2022).
Qualitatively, RDPG agents demonstrate robust long-horizon adaptation and explicit memory-use strategies, such as integrating noisy sensory measurements over time, recalling task parameters (e.g., system identification), and exploiting search-recall strategies in the Morris water maze (Heess et al., 2015, Song et al., 2017).

7. Limitations and Open Issues

While RDPG offers significant improvements in partially observable settings, several challenges persist:

Stability: Long-horizon tasks can expose RNNs to gradient instabilities. Addressing these requires aggressive gradient clipping, normalization, and sometimes truncated BPTT (Yang et al., 2021, Song et al., 2017).
Computational cost: The per-step time complexity scales as $O(H^2)$ due to BPTT, with increased memory and compute burden relative to feedforward approaches (Gharehgoli et al., 2022).
Exploration: Model-free RDPG can struggle with exploration in environments requiring systematic long-horizon strategies, where even advanced recurrent variants lag behind methods with advanced exploration heuristics (Yang et al., 2021).
Bias in TD estimation: When sampling short sub-trajectories, improper handling of initial hidden state and TD horizon can bias value estimates, requiring methods such as scan-initialization and TD interpolation (Song et al., 2017).
Generalization: The impact of external experience injection and history length hyperparameters remains domain-specific and requires meticulous tuning for optimal generalization (Song et al., 2017).