Recurrent Deep Q-Networks (DRQN) Overview

Updated 5 March 2026

DRQN is a reinforcement learning architecture that integrates recurrent layers (LSTM/GRU) to address partial observability by maintaining a memory of past observations.
It extends the traditional DQN framework with additional sequence modeling, improving credit assignment and performance in tasks where state information is incomplete.
Empirical studies demonstrate DRQN's effectiveness in diverse domains like Atari games, autonomous driving, and dialogue management, particularly under noisy or occluded conditions.

A Deep Recurrent Q-Network (DRQN) is a variant of the Deep Q-Network (DQN) architecture for reinforcement learning (RL) specifically designed to handle environments with partial observability. By integrating recurrent layers such as LSTMs or GRUs into Q-learning frameworks, DRQN allows agents to maintain a memory of past observations, aggregate information over time, and thus act effectively even when the underlying Markov state is hidden. DRQN and its derivatives have been widely studied across domains, including Atari games, autonomous driving, fog computing, adaptive systems, dialogue management, and multi-agent learning.

1. Foundations and Motivation

The canonical DQN framework assumes full observability and processes the current state (or a stack of recent frames) to estimate Q-values for all actions. In partially observable Markov decision processes (POMDPs), the agent receives incomplete or aliased observations $o_t$ rather than the true state $s_t$ . The original DQN addresses this via stacking a fixed window of previous frames, but this is insufficient when critical information lies outside this temporal window or the structure of observation noise is irregular.

To address such scenarios, Hausknecht and Stone introduced DRQN, which augments the DQN with a recurrent layer (typically an LSTM or GRU) after the convolutional backbone. This recurrence enables the agent to construct an internal belief state $h_t$ by integrating arbitrary histories of observations, making DRQN particularly well-suited for environments with occlusion, missing input, or rich temporal dependencies (Hausknecht et al., 2015).

DRQN’s advantage becomes more pronounced as the complexity and partial observability of the environment increase, enabling robust policies that adapt to varying degrees of observation fidelity. This property underpins its adoption across partially observable domains (Deshpande et al., 2020, Harb et al., 2017, Zhao et al., 2016).

2. Network Architectures and Algorithmic Variants

Core DRQN Structure

A prototypical DRQN consists of:

Input: At each time-step $t$ , the agent receives an observation $o_t$ (e.g., an image, feature vector, or sequence).
Feature Extraction: A stack of convolutional or dense layers processes $o_t$ . In vision domains, this is typically a 3-layer CNN (e.g., $84\times84$ input to $7\times7\times64$ activation) (Hausknecht et al., 2015).
Recurrence: The flattened feature map feeds into an LSTM (or GRU), yielding a hidden state $h_t = \mathrm{LSTM}(h_{t-1}, \phi(o_t))$ .
Q-Head: The recurrent output $h_t$ is projected to $s_t$ 0 Q-values via a fully connected layer, where $s_t$ 1 is the discrete action space.

This structure is extensible: stacked LSTMs have been used to improve capacity in urban autonomous driving (Deshpande et al., 2020), and GRU-based DRQN variants are common in resource allocation for fog networks (Baek et al., 2020).

Algorithmic Enhancements

Modern DRQN deployments incorporate key advances for stability and performance:

Double Q-Learning: Decouples action selection and evaluation in the Bellman target to mitigate value overestimation. The Double DRQN (“DRDQN”) update for transition $s_t$ 2 is:

$s_t$ 3

(Moreno-Vera, 2019, Deshpande et al., 2020, Schulze et al., 2018)

Prioritized Experience Replay: Samples sequences or transitions based on TD-error magnitude for efficient learning. Weighting is applied during backpropagation for bias correction (Schulze et al., 2018).
Eligibility Traces: Forward-view $s_t$ 4-returns are used to accelerate credit assignment; e.g., $s_t$ 5 (Harb et al., 2017).
Attention Mechanisms: Recent extensions (ARDDQN) apply an attention layer on RNN outputs before the Q-head to focus value estimation on salient temporal events (Kumar et al., 2024).
Snapshot Ensembling: Parameter snapshots from cyclical or annealed learning rates are averaged to form a more robust ensemble policy (Schulze et al., 2018).

Table 1. Key Architectural Components

Domain	Conv	Recurrence	Output Head
Atari (Hausknecht et al., 2015, Moreno-Vera, 2019)	Yes	LSTM (512)	FC to
Urban driving (Deshpande et al., 2020)	Yes	2× LSTM (256+256)	FC, 4 Q-values
Multi-fog (Baek et al., 2020)	1-D	GRU (128)	FC,
UAV path planning (Kumar et al., 2024)	Yes	LSTM/GRU/BiLSTM	FC, per action

3. Training Regimen, Optimization, and Experience Replay

Sequence-based Experience Collection

Standard DQN's experience replay buffer is extended to accommodate sequences for truncated backpropagation through time (BPTT) in DRQN. Each replay batch typically contains $s_t$ 6 sequences of fixed or variable length $s_t$ 7, with the LSTM state zeroed at the start of each sampled sequence (Hausknecht et al., 2015, Harb et al., 2017).

Example: In urban autonomous driving, full episodes (up to 1000 steps) are sampled, and training runs 8-step sequences per episode, resetting LSTM states between batches (Deshpande et al., 2020).
In multi-fog scenarios, sliding-window histories of $s_t$ 8 steps are concatenated and input to the GRU (Baek et al., 2020).

Optimization

Optimizers: Adam, RMSProp, and ADADELTA are commonly used. Adam, often used with a learning rate of $s_t$ 9, yields particularly rapid and stable convergence when combined with $h_t$ 0-returns (Harb et al., 2017, Deshpande et al., 2020).
Gradient Clipping: Applied to LSTM parameters for stability; e.g., max norm of 10 (or element-wise to $h_t$ 1) (Hausknecht et al., 2015, Moreno-Vera, 2019).
Target Network Updates: Employed to stabilize Q-learning; periodic hard updates (e.g., every 10,000 steps) or soft updates (Polyak averaging) (Deshpande et al., 2020, Miranda et al., 2020).
Exploration: $h_t$ 2-greedy, with schedules annealing $h_t$ 3 from 1.0 to 0.1 (or 0.01), sometimes with restarts and decays per sub-task (Baek et al., 2020).

BPTT through unrolled sequences enables correct temporal credit assignment, and forward-view eligibility traces further speed up reward propagation (Harb et al., 2017).

4. Empirical Findings and Comparative Analyses

Atari / ViZDoom (Games)

Performance: DRQN matches DQN on fully observed Atari games but demonstrates increased robustness under partial observability (e.g., flickering screens) and in longer-horizon credit assignment (Hausknecht et al., 2015). On memory-dependent tasks (e.g., Beam Rider, Tennis with delayed rewards), DRQN and DRQN+ $h_t$ 4 reach target performance faster and more stably (Harb et al., 2017).
Double DQN and PER: Integration of double Q-learning yields additional stability and reduces overestimation. Prioritized replay accelerates early learning but may amplify value overestimation (Schulze et al., 2018).
Attention Mechanisms: In UAV coverage and data harvesting, the attention-augmented LSTM variant yields substantial coverage and landing gains compared to non-recurrent or non-attention models (Kumar et al., 2024).

Robotics and Control

Urban autonomous driving: DRQN outperforms rule-based policies in collision avoidance and distance before task termination. Empirical metrics show collision-free rates of 70% versus 40% for baseline, and longer average distances traveled (Deshpande et al., 2020).
Fog computing: DRQN (GRU-based) achieves higher task success rates and lower buffer overflows compared to DQN and DCQN. Under heavy load, improvements are 10–20% in main performance metrics (Baek et al., 2020).
Self-adaptive microservices: DRQN enables faster convergence, higher cumulative reward, and lower adaptation time than DQN, DDQN, policy gradient (PGNN), and DDPG in microservice architecture adaptation (Magableh, 2019).

Dialogue Systems and Multi-Agent Scenarios

Dialogue management: DRQN-based latent state encodings outperform flat RL and modular baselines in task completion rate and robustness under noisy slot recognition (Miranda et al., 2020, Zhao et al., 2016).
Multi-agent communication: Deep Distributed Recurrent Q-Networks (DDRQN) extend DRQN with weight sharing, agent IDs, last-action inputs, and on-policy updates, enabling agents to learn emergent communication protocols for complex riddles (Foerster et al., 2016).

5. Limitations, Domain Considerations, and Design Choices

DRQN’s main strength—learned unbounded memory—incurs practical trade-offs:

Sample Efficiency and Overfitting: DRQNs may require longer training and larger replay buffers; their over-parameterization can cause slower convergence in domains where task-relevant information is contained in a short observation window (Romac et al., 2019).
Sequence Length Sensitivity: Truncation length for BPTT is a critical hyperparameter—too short fails to propagate credit; too long leads to instability and unmanageable memory cost (Hausknecht et al., 2015, Miranda et al., 2020).
Stacked Frame vs. Recurrent Memory: DRQN excels when temporal dependencies exceed the capability of frame stacking. If POMDP history is shallow, stacked-frame DQNs often train faster and more stably (Romac et al., 2019).

Authors recommend evaluating the temporal horizon of relevant information and adopting attention or eligibility traces if credit assignment remains inefficient (Harb et al., 2017, Kumar et al., 2024).

6. Advanced Applications and Hybridizations

DRQN’s modular architecture supports augmentation and domain specialization:

Hybrid Supervised-RL Models: Supervised auxiliary losses on the RNN’s hidden state (predicting future user action/slots) accelerate convergence and improve robustness in dialogue systems (Miranda et al., 2020, Zhao et al., 2016).
Dyna-style Synthetic Experience: Mixes real and model-based synthetic transitions to cover rare slots and accelerate learning in dialogue tasks (Zhao et al., 2016).
Attention-based Recurrence: In ARDDQN, an attention mechanism over an RNN sequence allows the Q-head to selectively focus on time steps most relevant for UAV path planning and data harvesting metrics (Kumar et al., 2024).
Multi-agent Coordination: DDRQN demonstrates that weight sharing, agent identification, and last-action input are essential for emergence of communication protocols in decentralised settings (Foerster et al., 2016).

Adaptation of DRQN to actor-critic, prioritized replay, dueling networks, and hierarchical or hybrid architectures is suggested for further performance gains (Deshpande et al., 2020, Kumar et al., 2024).

7. Outlook and Future Directions

Research directions identified in the literature include:

Expanding training scales (episodes, replay buffer size) and mixing in more complex scenarios for robustness (Deshpande et al., 2020).
Integration of advanced RL methods: prioritized experience replay, dueling architectures, and actor-critic schemes (Deshpande et al., 2020, Schulze et al., 2018).
Exploration of domain-specific recurrence: attention, episodic memory, external memory augmentation, or memory hierarchies (Kumar et al., 2024).
Investigation of stabilizing techniques for longer BPTT, on-policy/off-policy hybridization, and scalability to large multi-agent systems (Foerster et al., 2016).

The evidence to date positions DRQN and its augmented variants as foundational tools for deep RL in partially observed environments whenever temporal credit assignment, occlusion, or information asymmetry precludes fully Markovian solutions. The choice of architecture, training regime, and hybridization must be matched to domain requirements to optimize performance, sample efficiency, and generalization.