Papers
Topics
Authors
Recent
Search
2000 character limit reached

Deep Spiking Q-Networks (DSQN)

Updated 26 February 2026
  • Deep Spiking Q-Networks (DSQN) are spiking neural architectures that merge biologically inspired spiking behaviors with deep Q-learning for robust and energy-efficient reinforcement learning.
  • DSQN employs LIF/IF neuron models, surrogate gradient learning, and conversion techniques to accurately estimate continuous Q-values and facilitate efficient hardware deployment.
  • Empirical results demonstrate DSQN achieves median human-normalized scores up to 193.5% and 24x–32x energy savings over traditional DQNs, validating its superior performance.

Deep Spiking Q-Networks (DSQN) are spiking neural architectures designed for reinforcement learning with function approximation in Q-learning settings, especially tailored for compatibility with neuromorphic hardware. DSQN combines biologically inspired spiking neuron dynamics—such as leaky integrate-and-fire (LIF) or integrate-and-fire (IF) models—with deep convolutional architectures and modern reinforcement learning algorithms. The aim is to achieve high performance with marked energy efficiency and robustness, enabling direct or conversion-based deployment on event-driven platforms for tasks ranging from high-dimensional vision-based control to low-latency robotics.

1. Spiking Neuron Models and Q-Value Representation

DSQN implementations are built on variants of the integrate-and-fire neuron, with both leaky (LIF) and non-leaky (IF) dynamics represented in the literature. The LIF neuron updates its membrane potential VV according to

Ht=f(Vt1,Xt),St=Θ(HtVth),Vt=Ht(1St)+VresetStH_t = f(V_{t-1}, X_t), \quad S_t = \Theta(H_t - V_{th}), \quad V_t = H_t(1 - S_t) + V_{reset} S_t

where Θ\Theta is the step function, VthV_{th} and VresetV_{reset} are the membrane threshold and reset, and ff is the leaky update f(V,X)=V+1τ[(VVreset)+X]f(V,X) = V + \frac{1}{\tau}[-(V - V_{reset}) + X] with time constant τ\tau (Chen et al., 2022, Liu et al., 2021, Arfa et al., 31 Jul 2025).

Some DSQN variants deploy non-spiking leaky integrators (LI) in the output layer, using the post-simulation membrane voltage VV as a continuous Q-value decoder: Q(s,aj;θ)=max1tTVtjQ(s,a_j;\theta) = \max_{1 \le t \le T} V^j_t (Chen et al., 2022). Others use spike accumulation or average firing rates over TT steps, as in

Q(s,a)=1Tt=1TWLSL1,tQ(s,a) = \frac{1}{T}\sum_{t=1}^T W^L S^{L-1,t}

(Liu et al., 2021, Sun et al., 2022).

Advanced neuron models, such as ternary spiking with asymmetric thresholds, have been introduced to enhance representational capacity and reduce gradient estimation bias, mitigating vanishing gradient problems during surrogate gradient backpropagation (Ghoreishee et al., 3 Jun 2025).

2. Learning Algorithms and Credit Assignment

DSQN employs standard deep Q-learning, adapting the temporal-difference (TD) learning paradigm to the spiking, temporally extended setting. The TD target is given by

y=r+γmaxaQ(s,a;θ)y = r + \gamma \max_{a'} Q(s', a'; \theta^-)

where γ\gamma is the discount factor and θ\theta^- are target network parameters (Chen et al., 2022, Liu et al., 2021, Devkota et al., 2024). The loss function is the mean squared Bellman error: L(θ)=Es,a,r,s[(yQ(s,a;θ))2]L(\theta) = \mathbb{E}_{s,a,r,s'}\left[(y - Q(s,a;\theta))^2\right] Learning is performed by unrolling the spiking network over TT timesteps and applying surrogate gradients for non-differentiable operations (e.g., Θ\Theta or ternary thresholds), typically with smooth arctan or related parameterizations (Chen et al., 2022, Liu et al., 2021, Ghoreishee et al., 3 Jun 2025). Both spatial and temporal credit assignment are realized through backpropagation-through-time (BPTT) or similar chain-rule expansions.

Direct spiking credit assignment supports both single-task and multi-task learning; in multi-task DSQN architectures, auxiliary contextual or dendritic gating (see below) routes gradient and activity information in a task-dependent manner to prevent catastrophic forgetting (Devkota et al., 2024).

3. Architectural Design and Input Encoding

Most DSQN architectures for high-dimensional visual tasks mirror canonical DQN topologies, replacing ReLU with spiking units and matching convolutional/fully connected dimensioning. The typical configuration for Atari tasks is:

  • Conv1: 32 × 8×8, stride 4, LIF/IF
  • Conv2: 64 × 4×4, stride 2, LIF/IF
  • Conv3: 64 × 3×3, stride 1, LIF/IF
  • FC1: 512 units, LIF/IF
  • FC2: NAN_A units (actions), LI/non-spiking or spike sum

Input encoding strategies include:

Output decoding varies by implementation: continuous Q-values may be read from non-spiking units or as time-averaged spike rates; some variants take the maximum membrane voltage over the simulation window (Chen et al., 2022).

Network extensions leveraging active dendrites and context-dependent gating enable multi-task learning and per-task specialization, with task context signals modulating the effective receptive field and gating synaptic integration (Devkota et al., 2024). Dueling architectures (splitting value and advantage streams) are also supported in advanced multi-task DSQN variants.

4. Training Paradigms and Optimization

End-to-end DSQN training encompasses:

  • Replay-based off-policy Q-learning with ϵ\epsilon-greedy exploration
  • Surrogate-gradient learning (arctan, sigmoid, or custom derivatives)
  • Experience replay buffers ranging from 10410^4 to 10610^6 transitions
  • Target network updates every 10410^4 steps
  • Adam optimization with learning rates 10410^{-4}--2.5×1042.5\times10^{-4}
  • Simulation windows TT from 8 (for energy efficiency) up to 500 (for conversion methods)

Advanced DSQN techniques incorporate Potential Based Layer Normalization (pbLN), which rescales post-synaptic potentials within each layer to counteract spike feature information vanishing caused by exponential decay of spike variance in deep architectures (Sun et al., 2022). This normalization is essential for maintaining persistent spiking activity across deep layers and yielding stable RL learning dynamics.

Quantization and hardware-aware fine-tuning, especially for embedded deployment on neuromorphic chips (e.g., SpiNNaker2), are performed post-training via 8-bit weight mapping and threshold retuning, preserving spike dynamics and task performance while optimizing for low-power inference (Arfa et al., 31 Jul 2025).

5. Conversion vs. Direct Learning: Methodological Approaches

There are two principal ways to obtain DSQNs: a) Conversion: Train a standard deep Q-network, then convert it to a spiking equivalent by replacing non-linearities, normalizing weights via percentile-based scaling, and simulating the spiking dynamics for TT steps per action (Tan et al., 2020, Patel et al., 2019). Critical factors in conversion fidelity include

  • Choice of simulation steps TT (typically T=500T=500 for high-fidelity)
  • Percentile-based normalization parameters (optimal pp in [99.9,99.99][99.9,99.99])
  • Robust spike-count plus end-voltage readout for action selection

b) Direct End-to-End Spiking Learning: Train the spiking network from scratch with surrogate-gradient methods, leveraging the LIF or advanced neuron models and performing credit assignment directly in the spike-based domain (Chen et al., 2022, Liu et al., 2021, Sun et al., 2022, Ghoreishee et al., 3 Jun 2025). Direct methods enable the exploitation of spiking temporal code, avoid artifacts from ANN conversion, and generally provide better robustness and energy efficiency.

6. Empirical Performance, Robustness, and Hardware Deployment

Benchmark Results:

DSQN has been evaluated extensively on 17 Atari 2600 titles and classical control tasks. Key findings include:

  • DSQN matches or outperforms DQN on the majority of tasks, e.g., median human-normalized score: DQN 142.8%, ANN-SNN 96.2%, DSQN 193.5% (Chen et al., 2022), with up to 106.3% DQN score attained (Liu et al., 2021).
  • On multi-task Atari benchmarks with active dendritic modulation, DSQN variants achieve or exceed human-level performance per game (Devkota et al., 2024).
  • Converted DSQNs maintain policy alignment with original DQN (>95%>95\% agreement) when using robust readouts and sufficient simulation time (Tan et al., 2020).

Robustness:

DSQN demonstrates marked improvements in resilience to adversarial attacks and input occlusion:

  • White-box FGSM attack: DQN average score decay 30–99%, DSQN 0–75%, often under 20% (Chen et al., 2022).
  • Occlusion tests: SNN performance degrades gracefully, ReLU DQN collapses under localized corruption (Patel et al., 2019).

Energy Efficiency and Hardware Deployment:

7. Limitations, Extensions, and Open Questions

Despite demonstrated strengths, DSQN presents several current challenges:

Notable research extensions include:

  • Multi-task DSQNs with active dendritic gating and dueling stream architectures, highly effective at mitigating catastrophic forgetting (Devkota et al., 2024)
  • Asymmetric ternary neuron models for improved representation capacity and gradient flow in deep RL settings (Ghoreishee et al., 3 Jun 2025)
  • Potential based layer normalization for robust deep network training (Sun et al., 2022)

In sum, Deep Spiking Q-Networks synthesize biologically inspired computation and modern RL, enabling scalable, robust, and energy-efficient decision-making on neuromorphic hardware, with ongoing advances in network expressivity, training stability, and agent generality (Chen et al., 2022, Liu et al., 2021, Devkota et al., 2024, Ghoreishee et al., 3 Jun 2025, Tan et al., 2020, Sun et al., 2022, Arfa et al., 31 Jul 2025, Patel et al., 2019).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Spiking Q-Networks (DSQN).