Papers
Topics
Authors
Recent
2000 character limit reached

Spiking STDP Transformer (S²TDPT)

Updated 25 November 2025
  • The paper introduces S²TDPT, a neuromorphic transformer using STDP-based attention to reduce inference energy and eliminate floating-point operations.
  • It employs a multi-step LIF spiking neuron model with temporal encoding, achieving competitive accuracy on CIFAR benchmarks while cutting energy costs.
  • The design supports efficient in-memory computation and transparent interpretability through techniques like spiking Grad-CAM for clear attention mapping.

The Spiking STDP Transformer (S2^{2}TDPT) is a neuromorphic deep learning architecture that models self-attention using spike-timing-dependent plasticity (STDP), leveraging principles of biological learning to enable energy-efficient, hardware-friendly, and interpretable transformer models. The framework is specifically designed for deployment in neuromorphic computing environments, targeting the fundamental limitations of conventional transformer attention mechanisms, such as energy inefficiency, reliance on floating-point operations, and the von Neumann memory bottleneck. Notably, S2^{2}TDPT demonstrates competitive accuracy on benchmark datasets with a substantial reduction in inference energy compared to standard artificial neural network (ANN) transformers (Mondal et al., 18 Nov 2025).

1. Model Architecture and Spiking Neuron Dynamics

S2^{2}TDPT employs a hierarchical encoder structure operating on temporally replicated static images (e.g., CIFAR-10/100), processed over four discrete simulation timesteps. The initial input IRT×C×H×WI \in \mathbb{R}^{T \times C \times H \times W} is transformed through Spiking Patch Splitting (SPS), producing a patch-wise membrane potential tensor U0RT×N×DU_0 \in \mathbb{R}^{T \times N \times D}, where NN is the spatial token count and DD the embedding dimension.

Each encoder layer \ell includes:

  • S2^{2}TDPSA: an STDP-based self-attention sublayer,
  • Two-layer spiking MLP,
  • Pre- and post-attention membrane residual connections.

Neuron dynamics follow the multi-step leaky-integrate-and-fire (LIF) model: U[t]=H[t1]+X[t]U[t] = H[t-1] + X[t]

S[t]=Θ(U[t]Vth)S[t] = \Theta(U[t] - V_{th})

H[t]=VresetS[t]+βU[t](1S[t])H[t] = V_{reset} \cdot S[t] + \beta U[t] (1 - S[t])

where Θ\Theta is the Heaviside step, X[t]X[t] the synaptic input, VthV_{th} the threshold, VresetV_{reset} the reset potential, and β\beta the leak factor. Surrogate gradients facilitate end-to-end training via backpropagation.

2. STDP-Based Self-Attention Mechanism

Attention within S2^{2}TDPT is computed by embedding query-key similarity into local synaptic weights through STDP, completely eschewing non-biological operations such as dot-products or softmax normalization.

Temporal Coding: Binary spike tensors Qs,Ks{0,1}B×N×DHQ_s, K_s \in \{0,1\}^{B \times N \times D_H} encode queries and keys, respectively. Spike counts rQ(i),rK(j)r_Q(i), r_K(j) are mapped to latencies tQ(i),tK(j)t_Q(i), t_K(j), enabling temporal encoding of tokens. Temporal differences Δtij=tQ(i)tK(j)\Delta t_{ij} = t_Q(i) - t_K(j) serve as the input to the STDP kernel.

STDP Kernel: Synaptic modification follows an asymmetric exponential rule: Δwij(Δt)={A+exp(Δtτ+),Δt<0 (LTP) Aexp(Δtτ),Δt0 (LTD)\Delta w_{ij}(\Delta t) = \begin{cases} A_{+} \exp\bigl(\frac{\Delta t}{\tau_{+}}\bigr), & \Delta t < 0 \text{ (LTP)}\ -A_{-} \exp\bigl(-\frac{\Delta t}{\tau_{-}}\bigr), & \Delta t \geq 0 \text{ (LTD)} \end{cases} with potentiation (A+A_{+}), depression (AA_{-}) coefficients, and time constants τ+,τ\tau_{+}, \tau_{-}.

Attention Weighting: An offset woffsetw_{offset} ensures positivity in Aij=Δwij+woffsetA_{ij} = \Delta w_{ij} + w_{offset} (no explicit softmax or dot-product). The attended value is aggregated by addition-only operations, supporting efficient in-memory computation.

3. Input Representation and Temporal Dynamics

Input images are temporally broadcast, with SPS performing patch-wise spiking convolutional processing:

  • Z0=BN(Conv2d(I))Z_0 = \text{BN}(\text{Conv2d}(I))
  • ZSPE=BN(Conv2d(SN(Zi)))Z_{SPE} = \text{BN}(\text{Conv2d}(\text{SN}(Z_i)))
  • ZSPED=BN(Conv2d(MP(SN(Zj))))Z_{SPED} = \text{BN}(\text{Conv2d}(\text{MP}(\text{SN}(Z_j))))

Tokens are assigned via convolutional stages, and multi-timestep inference propagates membrane potentials through attention and MLP blocks per simulation step. Global pooling aggregates the outputs for final classification through a fully connected layer and softmax.

4. Training Protocol and Optimization

STDP modulates attention in the forward pass but does not serve as a learning rule for projection weights. All learnable parameters (SPS conv weights, MLP weights, scaling factors, thresholds, and offsets) are optimized end-to-end with backpropagation.

Key training parameters:

  • Batch size: 64
  • Timesteps: T=4T=4
  • Learning rate: 1×1031 \times 10^{-3} (cosine/step decay)
  • Epochs: 200
  • STDP kernel (A+=A=0.1A_+ = A_- = 0.1, τ+=τ=2.0\tau_+ = \tau_- = 2.0)
  • Weight decay: 1×1041 \times 10^{-4}
  • Optimizer: Adam or SGD with momentum

Loss function: cross-entropy on final output YY. Surrogate gradient techniques (e.g., piecewise linear Θ\Theta') are employed for gradient flow through spiking activations.

5. Empirical Performance and Energy Efficiency

On CIFAR benchmarks, S2^{2}TDPT achieves:

  • 94.35\% top-1 accuracy (CIFAR-10)
  • 78.08\% top-1 accuracy (CIFAR-100)

Energy consumption for a four-timestep CIFAR-100 inference is quantified by: ES2TDPT=EACSOPConv+SOPSSA+EMACFLOP1ConvE^{S^2TDPT} = E_{AC} \sum SOP^{Conv} + \sum SOP^{SSA} + E_{MAC} \cdot FLOP^{Conv}_1 with EAC=0.9 pJE_{AC} = 0.9~\text{pJ}, EMAC=46 pJE_{MAC} = 46~\text{pJ} (45 nm CMOS), totaling 0.49 mJ per inference.

Relative efficiency:

  • 88.47 % energy reduction versus standard ANN Transformer (4.25 mJ)
  • 37.97 % reduction versus Spikformer (0.79 mJ)
  • Hardware alignment: neuromorphic crossbars (memristors, phase-change, 2D materials) supporting local STDP

6. Model Interpretability via Spiking Grad-CAM

S2^{2}TDPT supports transparent attention analysis using adapted Grad-CAM:

  • Class score gradients w.r.t. final block membrane potentials compute pooled channel weights αk\alpha_k
  • Heatmap H=ReLU(kαkUL(k))H = \text{ReLU}(\sum_k \alpha_k U_L^{(k)}) overlays semantically relevant image regions
  • Spike Firing Rate Map (SFR): head/timestep/layer averaged spikes SFR(i,j)SFR(i, j) highlights focal attention patterns

Observations indicate that activation maps align with object boundaries (car body, dog's head), and high SFR correlates with these features, substantiating interpretable, object-centric attention emergence in the STDP-driven architecture.

7. Comparison with Standard ANN Transformers

Architectural Distinctions: S2^{2}TDPT's attention weights are computed via local spike-timing (binary spikes, Δt\Delta t \rightarrow STDP kernel), entirely in-memory, with no floating-point dot-product or intermediate matrix storage typical to ANN transformers.

Computational Complexity:

  • ANN Transformers: O(N2D)O(N^2D) MACs per head, explicit softmax
  • S2^{2}TDPT: O(N2)O(N^2) additions, lookup-based exponentials (or crossbar implementation), softmax eliminated

Memory and Hardware: ANN attention matrices impose quadratic memory traffic and the von Neumann bottleneck. S2^{2}TDPT stores synaptic weights locally, drastically reducing off-chip bandwidth.

Energy Profile: S2^{2}TDPT replaces high-power 32 bit MACs (46 pJ) with 1 bit ACs (0.9 pJ), mapping efficiently to crossbar hardware with event-driven, addition-only operations.

S2^{2}TDPT concretely demonstrates that biologically inspired spiking attention can address scaling, efficiency, and explainability challenges inherent to transformer models in neuromorphic contexts (Mondal et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Spiking STDP Transformer (S$^{2}$TDPT).