DARQN: Deep Attention Recurrent Q-Network
- DARQN is a reinforcement learning model that integrates soft and hard attention mechanisms with LSTM-based memory to focus on salient visual features.
- The architecture combines CNN-based feature extraction with an attention module that produces context vectors, guiding the recurrent Q-value estimation.
- Empirical results on Atari games demonstrate enhanced training efficiency, reduced parameterization, and improved interpretability compared to traditional DQN models.
The Deep Attention Recurrent Q-Network (DARQN) is an extension of the Deep Q-Network (DQN) architecture that incorporates "soft" and "hard" attention mechanisms within a recurrent reinforcement learning framework. Designed to operate on high-dimensional visual input, DARQN enables agents to dynamically focus on salient regions of an environment, providing both performance gains and interpretable decision-making artifacts. By coupling convolutional feature extraction, differentiable attention, and recurrent memory (LSTM), DARQN introduces a structured filter over input observations, yielding improvements in training efficiency, model compactness, and policy visualization (Sorokin et al., 2015).
1. Architectural Overview
DARQN processes each input time step through a sequence of modules: a convolutional neural network (CNN), an attention mechanism (soft or hard), and a recurrent Q-learning head. The typical flow is as follows:
- Input: Each raw grayscale frame is passed into a CNN, producing feature maps of spatial size (commonly , ).
- Feature Extraction: The output maps are reshaped into feature vectors with .
- Attention: The attention module generates a context vector :
- Soft attention computes a weighted sum 0, where 1 are focus weights.
- Hard attention samples a single location 2 and sets 3.
- Glimpse Generation and Recurrence: A fully-connected layer maps 4 to 5, which is processed by an LSTM: 6.
- Q-Value Estimation: The hidden state 7 is used to predict action values via a linear head: 8, for all 9.
This structure enables the recurrent module to utilize a sequence of attended glimpses, maintaining temporal memory throughout the agent's trajectory.
2. Attention Mechanisms: Soft and Hard
DARQN supports two distinct attention paradigms:
- Soft Attention: Fully differentiable. Attention weights 0 are derived by computing unnormalized attention scores:
1
and normalizing via softmax:
2
3 combines spatial features into a context vector passed to the LSTM.
- Hard Attention: Involves stochastic sampling. At each step, a location 4 is drawn from a learned policy 5. The context is set as 6. The policy parameters 7 are trained by a REINFORCE-style gradient:
8
where 9 is a learned baseline.
Both approaches reduce the effective dimensionality seen by the recurrent module, enabling more focused memory updates and enhancing interpretability.
3. Training Objective and Gradient Flow
The DARQN agent is trained using the Bellman squared error objective: 0 where target networks provide stable Q-learning targets.
For soft attention, gradients propagate through the attention weights and scores: 1 where 2.
Hard attention uses the REINFORCE estimator and a baseline for variance reduction. The gradients for both attention and LSTM modules are propagated by standard backpropagation through time (BPTT); all CNN, attention, and recurrent weights are updated accordingly.
4. Empirical Performance and Evaluation
DARQN was empirically evaluated on five Atari 2600 games (Breakout, Seaquest, Space Invaders, Tutankham, and Gopher), in direct comparison with the original DQN and the recurrent DRQN baseline. Results demonstrate that despite using approximately half the parameters of DQN, the soft-attention variant equaled or surpassed DQN in three out of five games. Notably:
| Game | DQN | DRQN | DARQN (soft) |
|---|---|---|---|
| Seaquest | 1,284 | 1,421 | 7,263 |
| Space Inv. | 916 | 571 | 650 |
| Gopher | 1,976 | 3,512 | 5,356 |
Examples include Seaquest, where soft-attention DARQN achieved 7,263 versus DQN's 1,284, and Gopher, where DARQN scored 5,356 versus DQN's 1,976 (Sorokin et al., 2015).
A plausible implication is that attentive filtering contributes to greater sample efficiency, reduced parameterization, and improved generalization on certain environments.
5. Interpretability and Visualization
An essential component of DARQN is its ability to provide interpretable attention maps. For each input frame, attention weights (soft) or selection indices (hard) can be overlaid, revealing the agent's spatial focus at each time step. In Breakout, the attention heatmap consistently tracks the ball, while in Seaquest, focus transitions from the oxygen gauge to the submarine, reflecting task-relevant priorities.
The visualizations facilitate online monitoring of agent behavior and support debugging or failure analysis. For instance, the hard-attention variant occasionally fixates on irrelevant sprites and fails to resurface, a behavior clearly diagnosed via attention overlays (Sorokin et al., 2015).
6. Significance and Connections
DARQN injects a differentiable filter gate into the deep Q-learning framework, creating a bridge between attention models and reinforcement learning. By focusing learning and memory resources on task-relevant visual regions, DARQN advances the study of interpretable agents and points toward architectures with lower complexity and improved sample efficiency. The modular attention-LSTM architecture foreshadows trends in integrating differentiable attention across temporal domains, with implications for both reinforcement learning and broader sequence modeling tasks (Sorokin et al., 2015).