Deep Spiking Q-Network (DSQN)

Updated 25 April 2026

DSQN is a deep reinforcement learning framework that replaces traditional neurons with leaky integrate-and-fire spiking models to mimic biological computation.
It achieves energy efficiency and robust performance on complex control tasks by leveraging neuromorphic hardware and advanced spike-based encoding.
The framework employs surrogate backpropagation and normalization techniques to address non-differentiability and training challenges inherent in spiking networks.

A Deep Spiking Q-Network (DSQN) is a Deep Reinforcement Learning (DRL) architecture that integrates spiking neural networks (SNNs)—notably, biologically inspired leaky integrate-and-fire (LIF) neuron models—with the Deep Q-Network (DQN) algorithm framework. DSQNs target energy-efficient, event-driven computation, supporting deployment on neuromorphic hardware, and seek to retain or exceed the performance of conventional DQN agents in high-dimensional domains such as Atari 2600 and complex control tasks (Chen et al., 2022, Liu et al., 2021, Tan et al., 2020). Modern DSQN frameworks encompass both direct end-to-end spiking RL and ANN-to-SNN conversion, and advance on challenges of sparsity, non-differentiability, value representation, information loss, task adaptation, and network stability.

1. Network Architecture and Neuron Models

DSQN architectures universally derive from the canonical DQN backbone—composed of convolutional layers followed by fully connected layers—but swap ReLU or other artificial neuron types for spiking units, primarily LIF neurons (Liu et al., 2021, Chen et al., 2022, Sun et al., 2022). The standard convolutional trunk consists of three layers with increasing filter counts (e.g., 32×8×8, 64×4×4, 64×3×3), transitioning to a 512-unit FC layer, and terminating in an output layer whose width matches the valid action set (Tan et al., 2020, Liu et al., 2021).

The membrane potential dynamics for LIF neurons are

$U^{l,t} = V^{l,t-1} + \frac{1}{\tau_m} (W^l S^{l-1,t} - V^{l,t-1} + V_r)$

followed by threshold based spiking emission and potential reset or soft subtraction (Liu et al., 2021). Some frameworks directly replace ReLU by IF (no leak) models for compatibility with ANN-to-SNN conversion methods (Tan et al., 2020). Task-dependent or multi-modal DSQNs further generalize the neuron model to include active dendrites and context gating (Devkota et al., 2024). Input encoding strategies include direct rate-based spike trains from normalized pixel inputs, population-based or fuzzy coding to mitigate information loss (Ghoreishee et al., 6 Apr 2026), and Bernoulli spike sampling (Ghoreishee et al., 3 Jun 2025).

Recent innovations introduce ternary spiking neurons with asymmetric thresholds to increase representational entropy and improve gradient flow compared to binary spiking (Ghoreishee et al., 3 Jun 2025).

2. Value Representation and Q-Readout

A central challenge in DSQN design is the accurate recovery of continuous Q-values from spike-based representations. Standard approaches average spike counts over a simulation window for each output neuron: $Q(s,a;\theta) = \frac{1}{T}\sum_{t=1}^T W^{L} S^{L-1, t}$ (Liu et al., 2021). In end-to-end variants, the final readout may use non-spiking “leaky integrate” layers to accumulate membrane potential without thresholding—providing a robust, continuous-valued Q(s,a) (Chen et al., 2022, Devkota et al., 2024).

For conversion-based DSQNs, robust output representations are necessary to suppress discretization-induced jitter; one prominent solution is

$f_\mathrm{last}(t) = r_\mathrm{last}(t) + \frac{V_\mathrm{last}(t)}{t V_{thr}}$

which cancels residual membrane errors and aligns SNN argmax decision with the source DQN (Tan et al., 2020).

Some architectures reconstruct Q-values from a population of spike counts using a neural decoder, trained jointly with the SNN core (Ghoreishee et al., 6 Apr 2026). This approach leverages expressivity and reclaims information lost in sparse spike regimes.

3. Training Algorithms and Surrogate Backpropagation

DSQN training leverages standard DQN RL objectives: $L(\theta) = \mathbb{E}_{s,a,r,s'} \left[ r + \gamma \max_{a'} Q(s',a';\theta^-) - Q(s,a;\theta) \right]^2$ with experience replay, target network updates, and ε-greedy exploration (Chen et al., 2022, Liu et al., 2021).

The principal technical obstacle is the non-differentiability of spike generation. DSQNs universally adopt surrogate gradient methods, replacing the discrete Heaviside threshold by smooth approximations such as the arctangent: $\sigma(x) = \frac{1}{\pi} \arctan(\pi x) + \frac{1}{2}, \qquad \sigma'(x) = \frac{1}{1 + (\pi x)^2}$ (Chen et al., 2022, Liu et al., 2021).

Training proceeds as backpropagation through time (BPTT) unrolled over the SNN simulation window, updating all weights via surrogate gradients of the spiking nonlinearity. Some DSQNs further implement advanced normalization (potential-based layer normalization, pbLN (Sun et al., 2022)) to mitigate feature and spike-variance collapse with network depth.

Input and parameter normalization is critical in conversion pipelines, ensuring firing rates in each SNN layer faithfully approximate original ANN activations over finite time windows (Tan et al., 2020).

4. Innovations in Information Encoding and Representation Capacity

DSQNs have recently incorporated several developments to overcome limitations imposed by direct rate encoding and the sparsity of spikes. Fuzzy encoder–decoder architectures utilize trainable membership functions to generate expressive, overcomplete spike population codes for each input scalar, with neural decoders reconstructing dense Q-values from sparse output activations. This increases the effective state encoding capacity, mitigates information loss, and restores discrimination between actions, matching non-spiking RL in autonomous driving tasks (Ghoreishee et al., 6 Apr 2026).

Ternary spiking neurons, especially those with asymmetric thresholds, enhance representational entropy and enable nonzero expected gradients during training, outperforming both binary and symmetric ternary models in deep Q-learning settings (Ghoreishee et al., 3 Jun 2025). A two-threshold ( $v_{th}^{p}, v_{th}^{n}$ ) scheme prevents vanishing gradients, supporting stable learning and improved policy performance for on-board, event-driven agents.

5. Multi-Task and Contextual Modulation in DSQN

DSQN frameworks have also been extended to generalist, multi-task settings via explicit context gating and active dendritic integration. In such architectures, each neuron adapts its activation to a provided one-hot task context via learned “dendritic” weights, dynamically forming specialized subnetworks for each task (Devkota et al., 2024). This biases the SNN toward task-appropriate submanifolds in weight space, remedies catastrophic forgetting, and enables joint learning across heterogeneous RL tasks, such as Atari and image classification. Dueling network heads can be grafted to decompose Q-values into value and advantage streams, further supporting stable multi-task RL.

6. Benchmarking, Energy Efficiency, and Empirical Results

Directly-trained DSQNs and conversion-based SNNs have been benchmarked extensively on Atari 2600 games (Tan et al., 2020, Chen et al., 2022, Liu et al., 2021, Sun et al., 2022, Devkota et al., 2024). End-to-end LIF-based DSQNs consistently achieve or exceed DQN and conversion SNN performance, demonstrating high sample efficiency (e.g., convergence in 20 million frames vs. 50 million for DQN), robust Q-value learning curves, and superior resilience to adversarial input perturbations (Chen et al., 2022, Liu et al., 2021). In typical settings DSQNs match DQN normalized scores in 15 of 17 test cases; potential-based normalization pushes scores beyond DQN on 15/16 tasks (Sun et al., 2022).

On neuromorphic hardware, DSQNs provide 95–97% per-inference energy savings over ANN/DQN, with reduction factors derived from synaptic operation counts, spike sparsity, and exploitably event-driven memory access. Complex population-coding and attention-based DSQNs close the performance gap with non-spiking agents in multi-modal autonomous driving (Ghoreishee et al., 6 Apr 2026).

7. Limitations, Open Problems, and Future Directions

Key limitations include the computational and memory overheads of BPTT, ad-hoc nature of surrogate gradient selection, and the gap between biological plausibility and current training regimes. Extension to continuous action domains, actor-critic architectures, and real-time hardware deployment remain active research areas (Sun et al., 2022).

Recent work points toward fully end-to-end SNN RL—eschewing any reliance on artificial neuron intermediaries or post-hoc conversion—as the path to optimal energy-performance trade-offs and on-chip deployment. Advances in input population coding, ternary/asymmetric spiking, and context-dependent routing will likely shape the next generation of spiking DRL agents.

References

"Strategy and Benchmark for Converting Deep Q-Networks to Event-Driven Spiking Neural Networks" (Tan et al., 2020)
"Deep Reinforcement Learning with Spiking Q-learning" (Chen et al., 2022)
"Fuzzy Encoding-Decoding to Improve Spiking Q-Learning Performance in Autonomous Driving" (Ghoreishee et al., 6 Apr 2026)
"Improving Performance of Spike-based Deep Q-Learning using Ternary Neurons" (Ghoreishee et al., 3 Jun 2025)
"Solving the Spike Feature Information Vanishing Problem in Spiking Deep Q Network with Potential Based Normalization" (Sun et al., 2022)
"MTSpark: Enabling Multi-Task Learning with Spiking Neural Networks for Generalist Agents" (Devkota et al., 2024)
"Human-Level Control through Directly-Trained Deep Spiking Q-Networks" (Liu et al., 2021)