QKFormer: Hierarchical Spiking Transformer
- QKFormer is a hierarchical spiking transformer that utilizes spike-form Q-K attention, enabling efficient, large-scale neuromorphic vision with reduced memory and computation.
- It integrates multi-scale spike-coded representations through hierarchical decomposition and a deformed-shortcut patch embedding (PEDS) to preserve spike timing fidelity.
- Empirical evaluations show state-of-the-art performance on benchmarks like ImageNet and CIFAR, achieving high accuracy with significant energy savings on neuromorphic hardware.
QKFormer is a hierarchical spiking transformer architecture that introduces a spike-form Query-Key (Q-K) attention mechanism tailored for spiking neural networks (SNNs). It is designed for energy-efficient, large-scale neuromorphic vision tasks, achieving state-of-the-art direct-training SNN performance through hierarchical decomposition, sparse binary attention, and a bespoke patch embedding with deformed shortcuts. QKFormer enables linear-complexity attention, multi-scale spike-coded representation, and efficient neuromorphic hardware deployment (Zhou et al., 2024, Chen et al., 18 Sep 2025).
1. Spike-Form Q–K Attention Mechanism
QKFormer innovates by replacing the standard Query-Key-Value (QKV) attention triplet with a pure spike-based Q-K formulation. Let denote the spike-coded queries and keys, with time steps, tokens, and channels per head. There are two principal modes:
- Token-Wise Attention (QKTA):
is summed across channels for each token and thresholded via a spiking neuron function :
The output is a binary mask applied to via Hadamard product:
A spiking MLP (two-layer, with batch normalization and ) projects this output back to the residual stream.
- Channel-Wise Attention (QKCA):
Analogously, tokens are summed to produce an attention vector over channels. Each form enables efficient modeling of importance across either token or channel dimension.
This spike-form attention avoids full softmax and value projection, yielding 0 (or 1 for QKCA) per-head computational complexity, a sharp reduction compared to 2 for vision self-attention (VSA) or spike self-attention (SSA), and reducing total memory from 3 to 4. All operations are confined to simple 5 addition and masking, optimizing deployment on neuromorphic hardware (Zhou et al., 2024).
2. Hierarchical Spiking Transformer Architecture
QKFormer employs a three-stage hierarchy to generate multi-scale spike-based representations.
- Stage 1: Processes 6 image patches, yielding feature maps at 7 resolution with width 8.
- Stage 2: Downsamples by 9 patch embedding, further halving spatial dimensions and doubling channel width.
- Stage 3: Repeats downsampling to 0, quadrupling the channels relative to the first stage.
Each stage comprises 1 QKFormer blocks, each constructed as follows:
2
This residual topology maintains membrane potentials and spike timing fidelity across all scales (Zhou et al., 2024).
3. Deformed-Shortcut Patch Embedding (PEDS)
Typical patch embedding in transformers disrupts residual connections due to mismatched dimensions. QKFormer introduces a deformed-shortcut mechanism (PEDS) that learns a 3 convolutional shortcut, 4, parallel to the main path:
5
Here, 6 comprises convolution–BN–pool–SN–conv–BN–SN, and 7 enables dimensional and stride adaptation. This design ensures spike timing information is propagated, which empirically improves classification accuracy, as observed via ablation (e.g., on CIFAR100, 8) (Zhou et al., 2024).
4. Mathematical Foundations and Training Methods
The core neuronal unit is a leaky integrate-and-fire (LIF) spiking neuron, with the following update at timestep 9:
0
1
2
The non-differentiable spike 3 is handled by a surrogate gradient:
4
QKFormer is directly trained via backpropagation through time (BPTT) across 5 steps, using AdamW with a learning rate scaled by batch size. For large-scale benchmarks (e.g., ImageNet), ImageNet-1K training uses batch size 512 on 8 6 V100 GPUs over 200 epochs, with augmentations including RandAugment, random erasing, and stochastic depth (Zhou et al., 2024).
5. Empirical Performance and Ablations
QKFormer attains superior results over all prior SNNs:
- On ImageNet-1K, HST-10-768 (7M params, 8) achieves 9 top-1 accuracy, surpassing Spikformer-8-768 (0) by 1. This is the first instance of direct-trained SNNs exceeding 2 top-1 accuracy on ImageNet.
- On CIFAR10, HST-4-384 (3M, 4): 5 (vs. 6); on CIFAR100: 7 (vs. 8).
- On DVS128 Gesture (9M): 0 (1); CIFAR10-DVS: 2, exceeding Spikformer by 3.
Ablations show:
- PEDS consistently boosts accuracy across benchmarks.
- QKTA, QKCA, and their combination provide comparable accuracy, with QKCA favored for layers with large 4.
- Memory reduction is marked: at 5, SSA blocks use 626MB, QKTA uses 72.5MB.
- Firing rates (Stage 1, 8M model): 9, 0, 1, 2, 3, indicating strong event-driven sparsity (Zhou et al., 2024).
6. Neuromorphic Hardware Integration and NEURAL Accelerator
QKFormer is efficiently realized in hardware, as demonstrated by the NEURAL architecture, a hybrid data-event neuromorphic accelerator hosting QKFormer blocks natively (Chen et al., 18 Sep 2025). Key features include:
- On-the-fly spike-driven QKFormer: Operations are embedded within a conventional spiking convolutional pipeline via a pair of elastic FIFOs per processing element and an OR register, eliminating the need for specialized transformer hardware.
- Window-to-Time-to-First-Spike (W2TTFS): Average pooling is replaced by a fully spike-based downsampling mechanism that encodes the first spike event.
- Single-timestep training: Using knowledge distillation (KD) from a high-accuracy ANN teacher, followed by quantization-aware and spike-only fine-tuning, enables low-latency inference.
NEURAL, implemented on a Xilinx Virtex-7 FPGA, supports QKFormer at minimal overhead: halving logic and memory use versus prior SNN accelerators, supporting real-time (68 FPS) and low-power (0.79 W, 52.37 GSOPS/W) operation. Deploying spiking QKFormer blocks yields accuracy gains with only 42 ms extra latency and negligible spike cost (energy per image 510 mJ) (Chen et al., 18 Sep 2025).
7. Limitations and Prospects
Current QKFormer models rely on surrogate gradient methods and still require moderate time-steps (6) for large-scale image tasks such as ImageNet, while some downstream hardware and KD implementations achieve 7. Future research may address latency reduction, self-supervised or contrastive pre-training tailored for SNNs, and end-to-end hardware-software co-design. A plausible implication is that architectural advances in spike-form attention and multi-scale SNNs, as demonstrated by QKFormer, will influence both algorithmic and neuromorphic system development for efficient, large-scale event-driven learning (Zhou et al., 2024, Chen et al., 18 Sep 2025).