Papers
Topics
Authors
Recent
Search
2000 character limit reached

QKFormer: Hierarchical Spiking Transformer

Updated 7 March 2026
  • QKFormer is a hierarchical spiking transformer that utilizes spike-form Q-K attention, enabling efficient, large-scale neuromorphic vision with reduced memory and computation.
  • It integrates multi-scale spike-coded representations through hierarchical decomposition and a deformed-shortcut patch embedding (PEDS) to preserve spike timing fidelity.
  • Empirical evaluations show state-of-the-art performance on benchmarks like ImageNet and CIFAR, achieving high accuracy with significant energy savings on neuromorphic hardware.

QKFormer is a hierarchical spiking transformer architecture that introduces a spike-form Query-Key (Q-K) attention mechanism tailored for spiking neural networks (SNNs). It is designed for energy-efficient, large-scale neuromorphic vision tasks, achieving state-of-the-art direct-training SNN performance through hierarchical decomposition, sparse binary attention, and a bespoke patch embedding with deformed shortcuts. QKFormer enables linear-complexity attention, multi-scale spike-coded representation, and efficient neuromorphic hardware deployment (Zhou et al., 2024, Chen et al., 18 Sep 2025).

1. Spike-Form Q–K Attention Mechanism

QKFormer innovates by replacing the standard Query-Key-Value (QKV) attention triplet with a pure spike-based Q-K formulation. Let Q,K{0,1}T×N×DQ, K \in \{0,1\}^{T \times N \times D} denote the spike-coded queries and keys, with TT time steps, NN tokens, and DD channels per head. There are two principal modes:

QQ is summed across channels for each token and thresholded via a spiking neuron function SN\mathrm{SN}:

At[j]=SN(i=1DQi,j){0,1},j=1,,NA_t[j] = \mathrm{SN}\left(\sum_{i=1}^D Q_{i,j}\right) \in \{0,1\}, \quad j = 1, \dots, N

The output is a binary mask applied to KK via Hadamard product:

Xi,j=At[j]Ki,jX'_{i, j} = A_t[j] \otimes K_{i, j}

A spiking MLP (two-layer, with batch normalization and SN\mathrm{SN}) projects this output back to the residual stream.

  • Channel-Wise Attention (QKCA):

Analogously, tokens are summed to produce an attention vector over channels. Each form enables efficient modeling of importance across either token or channel dimension.

This spike-form attention avoids full softmax and value projection, yielding TT0 (or TT1 for QKCA) per-head computational complexity, a sharp reduction compared to TT2 for vision self-attention (VSA) or spike self-attention (SSA), and reducing total memory from TT3 to TT4. All operations are confined to simple TT5 addition and masking, optimizing deployment on neuromorphic hardware (Zhou et al., 2024).

2. Hierarchical Spiking Transformer Architecture

QKFormer employs a three-stage hierarchy to generate multi-scale spike-based representations.

  • Stage 1: Processes TT6 image patches, yielding feature maps at TT7 resolution with width TT8.
  • Stage 2: Downsamples by TT9 patch embedding, further halving spatial dimensions and doubling channel width.
  • Stage 3: Repeats downsampling to NN0, quadrupling the channels relative to the first stage.

Each stage comprises NN1 QKFormer blocks, each constructed as follows:

NN2

This residual topology maintains membrane potentials and spike timing fidelity across all scales (Zhou et al., 2024).

3. Deformed-Shortcut Patch Embedding (PEDS)

Typical patch embedding in transformers disrupts residual connections due to mismatched dimensions. QKFormer introduces a deformed-shortcut mechanism (PEDS) that learns a NN3 convolutional shortcut, NN4, parallel to the main path:

NN5

Here, NN6 comprises convolution–BN–pool–SN–conv–BN–SN, and NN7 enables dimensional and stride adaptation. This design ensures spike timing information is propagated, which empirically improves classification accuracy, as observed via ablation (e.g., on CIFAR100, NN8) (Zhou et al., 2024).

4. Mathematical Foundations and Training Methods

The core neuronal unit is a leaky integrate-and-fire (LIF) spiking neuron, with the following update at timestep NN9:

DD0

DD1

DD2

The non-differentiable spike DD3 is handled by a surrogate gradient:

DD4

QKFormer is directly trained via backpropagation through time (BPTT) across DD5 steps, using AdamW with a learning rate scaled by batch size. For large-scale benchmarks (e.g., ImageNet), ImageNet-1K training uses batch size 512 on 8 DD6 V100 GPUs over 200 epochs, with augmentations including RandAugment, random erasing, and stochastic depth (Zhou et al., 2024).

5. Empirical Performance and Ablations

QKFormer attains superior results over all prior SNNs:

  • On ImageNet-1K, HST-10-768 (DD7M params, DD8) achieves DD9 top-1 accuracy, surpassing Spikformer-8-768 (QQ0) by QQ1. This is the first instance of direct-trained SNNs exceeding QQ2 top-1 accuracy on ImageNet.
  • On CIFAR10, HST-4-384 (QQ3M, QQ4): QQ5 (vs. QQ6); on CIFAR100: QQ7 (vs. QQ8).
  • On DVS128 Gesture (QQ9M): SN\mathrm{SN}0 (SN\mathrm{SN}1); CIFAR10-DVS: SN\mathrm{SN}2, exceeding Spikformer by SN\mathrm{SN}3.

Ablations show:

  • PEDS consistently boosts accuracy across benchmarks.
  • QKTA, QKCA, and their combination provide comparable accuracy, with QKCA favored for layers with large SN\mathrm{SN}4.
  • Memory reduction is marked: at SN\mathrm{SN}5, SSA blocks use SN\mathrm{SN}626MB, QKTA uses SN\mathrm{SN}72.5MB.
  • Firing rates (Stage 1, SN\mathrm{SN}8M model): SN\mathrm{SN}9, At[j]=SN(i=1DQi,j){0,1},j=1,,NA_t[j] = \mathrm{SN}\left(\sum_{i=1}^D Q_{i,j}\right) \in \{0,1\}, \quad j = 1, \dots, N0, At[j]=SN(i=1DQi,j){0,1},j=1,,NA_t[j] = \mathrm{SN}\left(\sum_{i=1}^D Q_{i,j}\right) \in \{0,1\}, \quad j = 1, \dots, N1, At[j]=SN(i=1DQi,j){0,1},j=1,,NA_t[j] = \mathrm{SN}\left(\sum_{i=1}^D Q_{i,j}\right) \in \{0,1\}, \quad j = 1, \dots, N2, At[j]=SN(i=1DQi,j){0,1},j=1,,NA_t[j] = \mathrm{SN}\left(\sum_{i=1}^D Q_{i,j}\right) \in \{0,1\}, \quad j = 1, \dots, N3, indicating strong event-driven sparsity (Zhou et al., 2024).

6. Neuromorphic Hardware Integration and NEURAL Accelerator

QKFormer is efficiently realized in hardware, as demonstrated by the NEURAL architecture, a hybrid data-event neuromorphic accelerator hosting QKFormer blocks natively (Chen et al., 18 Sep 2025). Key features include:

  • On-the-fly spike-driven QKFormer: Operations are embedded within a conventional spiking convolutional pipeline via a pair of elastic FIFOs per processing element and an OR register, eliminating the need for specialized transformer hardware.
  • Window-to-Time-to-First-Spike (W2TTFS): Average pooling is replaced by a fully spike-based downsampling mechanism that encodes the first spike event.
  • Single-timestep training: Using knowledge distillation (KD) from a high-accuracy ANN teacher, followed by quantization-aware and spike-only fine-tuning, enables low-latency inference.

NEURAL, implemented on a Xilinx Virtex-7 FPGA, supports QKFormer at minimal overhead: halving logic and memory use versus prior SNN accelerators, supporting real-time (68 FPS) and low-power (0.79 W, 52.37 GSOPS/W) operation. Deploying spiking QKFormer blocks yields accuracy gains with only At[j]=SN(i=1DQi,j){0,1},j=1,,NA_t[j] = \mathrm{SN}\left(\sum_{i=1}^D Q_{i,j}\right) \in \{0,1\}, \quad j = 1, \dots, N42 ms extra latency and negligible spike cost (energy per image At[j]=SN(i=1DQi,j){0,1},j=1,,NA_t[j] = \mathrm{SN}\left(\sum_{i=1}^D Q_{i,j}\right) \in \{0,1\}, \quad j = 1, \dots, N510 mJ) (Chen et al., 18 Sep 2025).

7. Limitations and Prospects

Current QKFormer models rely on surrogate gradient methods and still require moderate time-steps (At[j]=SN(i=1DQi,j){0,1},j=1,,NA_t[j] = \mathrm{SN}\left(\sum_{i=1}^D Q_{i,j}\right) \in \{0,1\}, \quad j = 1, \dots, N6) for large-scale image tasks such as ImageNet, while some downstream hardware and KD implementations achieve At[j]=SN(i=1DQi,j){0,1},j=1,,NA_t[j] = \mathrm{SN}\left(\sum_{i=1}^D Q_{i,j}\right) \in \{0,1\}, \quad j = 1, \dots, N7. Future research may address latency reduction, self-supervised or contrastive pre-training tailored for SNNs, and end-to-end hardware-software co-design. A plausible implication is that architectural advances in spike-form attention and multi-scale SNNs, as demonstrated by QKFormer, will influence both algorithmic and neuromorphic system development for efficient, large-scale event-driven learning (Zhou et al., 2024, Chen et al., 18 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to QKFormer.