Deep Spiking Q-Networks (DSQN)

Updated 26 February 2026

Deep Spiking Q-Networks (DSQN) are spiking neural architectures that merge biologically inspired spiking behaviors with deep Q-learning for robust and energy-efficient reinforcement learning.
DSQN employs LIF/IF neuron models, surrogate gradient learning, and conversion techniques to accurately estimate continuous Q-values and facilitate efficient hardware deployment.
Empirical results demonstrate DSQN achieves median human-normalized scores up to 193.5% and 24x–32x energy savings over traditional DQNs, validating its superior performance.

Deep Spiking Q-Networks (DSQN) are spiking neural architectures designed for reinforcement learning with function approximation in Q-learning settings, especially tailored for compatibility with neuromorphic hardware. DSQN combines biologically inspired spiking neuron dynamics—such as leaky integrate-and-fire (LIF) or integrate-and-fire (IF) models—with deep convolutional architectures and modern reinforcement learning algorithms. The aim is to achieve high performance with marked energy efficiency and robustness, enabling direct or conversion-based deployment on event-driven platforms for tasks ranging from high-dimensional vision-based control to low-latency robotics.

1. Spiking Neuron Models and Q-Value Representation

DSQN implementations are built on variants of the integrate-and-fire neuron, with both leaky (LIF) and non-leaky (IF) dynamics represented in the literature. The LIF neuron updates its membrane potential $V$ according to

$H_t = f(V_{t-1}, X_t), \quad S_t = \Theta(H_t - V_{th}), \quad V_t = H_t(1 - S_t) + V_{reset} S_t$

where $\Theta$ is the step function, $V_{th}$ and $V_{reset}$ are the membrane threshold and reset, and $f$ is the leaky update $f(V,X) = V + \frac{1}{\tau}[-(V - V_{reset}) + X]$ with time constant $\tau$ (Chen et al., 2022, Liu et al., 2021, Arfa et al., 31 Jul 2025).

Some DSQN variants deploy non-spiking leaky integrators (LI) in the output layer, using the post-simulation membrane voltage $V$ as a continuous Q-value decoder: $Q(s,a_j;\theta) = \max_{1 \le t \le T} V^j_t$ (Chen et al., 2022). Others use spike accumulation or average firing rates over $T$ steps, as in

$Q(s,a) = \frac{1}{T}\sum_{t=1}^T W^L S^{L-1,t}$

(Liu et al., 2021, Sun et al., 2022).

Advanced neuron models, such as ternary spiking with asymmetric thresholds, have been introduced to enhance representational capacity and reduce gradient estimation bias, mitigating vanishing gradient problems during surrogate gradient backpropagation (Ghoreishee et al., 3 Jun 2025).

2. Learning Algorithms and Credit Assignment

DSQN employs standard deep Q-learning, adapting the temporal-difference (TD) learning paradigm to the spiking, temporally extended setting. The TD target is given by

$y = r + \gamma \max_{a'} Q(s', a'; \theta^-)$

where $\gamma$ is the discount factor and $\theta^-$ are target network parameters (Chen et al., 2022, Liu et al., 2021, Devkota et al., 2024). The loss function is the mean squared Bellman error: $L(\theta) = \mathbb{E}_{s,a,r,s'}\left[(y - Q(s,a;\theta))^2\right]$ Learning is performed by unrolling the spiking network over $T$ timesteps and applying surrogate gradients for non-differentiable operations (e.g., $\Theta$ or ternary thresholds), typically with smooth arctan or related parameterizations (Chen et al., 2022, Liu et al., 2021, Ghoreishee et al., 3 Jun 2025). Both spatial and temporal credit assignment are realized through backpropagation-through-time (BPTT) or similar chain-rule expansions.

Direct spiking credit assignment supports both single-task and multi-task learning; in multi-task DSQN architectures, auxiliary contextual or dendritic gating (see below) routes gradient and activity information in a task-dependent manner to prevent catastrophic forgetting (Devkota et al., 2024).

3. Architectural Design and Input Encoding

Most DSQN architectures for high-dimensional visual tasks mirror canonical DQN topologies, replacing ReLU with spiking units and matching convolutional/fully connected dimensioning. The typical configuration for Atari tasks is:

Conv1: 32 × 8×8, stride 4, LIF/IF
Conv2: 64 × 4×4, stride 2, LIF/IF
Conv3: 64 × 3×3, stride 1, LIF/IF
FC1: 512 units, LIF/IF
FC2: $N_A$ units (actions), LI/non-spiking or spike sum

Input encoding strategies include:

Direct pixel intensities projected as constant currents (Chen et al., 2022, Sun et al., 2022)
Rate-coded Poisson or Bernoulli spike trains (Ghoreishee et al., 3 Jun 2025, Arfa et al., 31 Jul 2025)
Two-neuron signed encoding for continuous features (Arfa et al., 31 Jul 2025)

Output decoding varies by implementation: continuous Q-values may be read from non-spiking units or as time-averaged spike rates; some variants take the maximum membrane voltage over the simulation window (Chen et al., 2022).

Network extensions leveraging active dendrites and context-dependent gating enable multi-task learning and per-task specialization, with task context signals modulating the effective receptive field and gating synaptic integration (Devkota et al., 2024). Dueling architectures (splitting value and advantage streams) are also supported in advanced multi-task DSQN variants.

4. Training Paradigms and Optimization

End-to-end DSQN training encompasses:

Replay-based off-policy Q-learning with $\epsilon$ -greedy exploration
Surrogate-gradient learning (arctan, sigmoid, or custom derivatives)
Experience replay buffers ranging from $10^4$ to $10^6$ transitions
Target network updates every $10^4$ steps
Adam optimization with learning rates $10^{-4}$ -- $2.5\times10^{-4}$
Simulation windows $T$ from 8 (for energy efficiency) up to 500 (for conversion methods)

Advanced DSQN techniques incorporate Potential Based Layer Normalization (pbLN), which rescales post-synaptic potentials within each layer to counteract spike feature information vanishing caused by exponential decay of spike variance in deep architectures (Sun et al., 2022). This normalization is essential for maintaining persistent spiking activity across deep layers and yielding stable RL learning dynamics.

Quantization and hardware-aware fine-tuning, especially for embedded deployment on neuromorphic chips (e.g., SpiNNaker2), are performed post-training via 8-bit weight mapping and threshold retuning, preserving spike dynamics and task performance while optimizing for low-power inference (Arfa et al., 31 Jul 2025).

5. Conversion vs. Direct Learning: Methodological Approaches

There are two principal ways to obtain DSQNs: a) Conversion: Train a standard deep Q-network, then convert it to a spiking equivalent by replacing non-linearities, normalizing weights via percentile-based scaling, and simulating the spiking dynamics for $T$ steps per action (Tan et al., 2020, Patel et al., 2019). Critical factors in conversion fidelity include

Choice of simulation steps $T$ (typically $T=500$ for high-fidelity)
Percentile-based normalization parameters (optimal $p$ in $[99.9,99.99]$ )
Robust spike-count plus end-voltage readout for action selection

b) Direct End-to-End Spiking Learning: Train the spiking network from scratch with surrogate-gradient methods, leveraging the LIF or advanced neuron models and performing credit assignment directly in the spike-based domain (Chen et al., 2022, Liu et al., 2021, Sun et al., 2022, Ghoreishee et al., 3 Jun 2025). Direct methods enable the exploitation of spiking temporal code, avoid artifacts from ANN conversion, and generally provide better robustness and energy efficiency.

6. Empirical Performance, Robustness, and Hardware Deployment

Benchmark Results:

DSQN has been evaluated extensively on 17 Atari 2600 titles and classical control tasks. Key findings include:

DSQN matches or outperforms DQN on the majority of tasks, e.g., median human-normalized score: DQN 142.8%, ANN-SNN 96.2%, DSQN 193.5% (Chen et al., 2022), with up to 106.3% DQN score attained (Liu et al., 2021).
On multi-task Atari benchmarks with active dendritic modulation, DSQN variants achieve or exceed human-level performance per game (Devkota et al., 2024).
Converted DSQNs maintain policy alignment with original DQN ( $>95\%$ agreement) when using robust readouts and sufficient simulation time (Tan et al., 2020).

Robustness:

DSQN demonstrates marked improvements in resilience to adversarial attacks and input occlusion:

White-box FGSM attack: DQN average score decay 30–99%, DSQN 0–75%, often under 20% (Chen et al., 2022).
Occlusion tests: SNN performance degrades gracefully, ReLU DQN collapses under localized corruption (Patel et al., 2019).

Energy Efficiency and Hardware Deployment:

Theoretical energy savings up to $96\%$ vs. DQN on 45 nm CMOS ( $E_{DSQN}/E_{DQN}\approx 4\%$ ) (Chen et al., 2022); empirically, SpiNNaker2 deployment yields $24\times$ – $32\times$ lower energy per episode at equivalent latency to GPU (Arfa et al., 31 Jul 2025).
Fully spiking and ternary DSQNs are deployable on neuromorphic platforms (Loihi, TrueNorth, DYNAPs, SpiNNaker2), supporting ultra-low power, real-time embedded RL (Arfa et al., 31 Jul 2025, Ghoreishee et al., 3 Jun 2025, Tan et al., 2020).

7. Limitations, Extensions, and Open Questions

Despite demonstrated strengths, DSQN presents several current challenges:

Sample inefficiency relative to state-of-the-art RL (e.g., requiring $20$ M frames for Atari convergence) (Chen et al., 2022)
Hyperparameter sensitivity: simulation window $T$ , time constant $\tau$ , and normalization parameters must be tuned per task (Sun et al., 2022, Chen et al., 2022)
Theoretical properties of surrogate-gradient-based spike TD-learning remain incompletely understood (Chen et al., 2022, Ghoreishee et al., 3 Jun 2025)
Current methods focus on discrete action spaces; extensions to continuous action RL and on-chip online learning are largely unexplored (Chen et al., 2022)

Notable research extensions include:

Multi-task DSQNs with active dendritic gating and dueling stream architectures, highly effective at mitigating catastrophic forgetting (Devkota et al., 2024)
Asymmetric ternary neuron models for improved representation capacity and gradient flow in deep RL settings (Ghoreishee et al., 3 Jun 2025)
Potential based layer normalization for robust deep network training (Sun et al., 2022)

In sum, Deep Spiking Q-Networks synthesize biologically inspired computation and modern RL, enabling scalable, robust, and energy-efficient decision-making on neuromorphic hardware, with ongoing advances in network expressivity, training stability, and agent generality (Chen et al., 2022, Liu et al., 2021, Devkota et al., 2024, Ghoreishee et al., 3 Jun 2025, Tan et al., 2020, Sun et al., 2022, Arfa et al., 31 Jul 2025, Patel et al., 2019).