Spiking STDP Transformer (S²TDPT)

Updated 25 November 2025

The paper introduces S²TDPT, a neuromorphic transformer using STDP-based attention to reduce inference energy and eliminate floating-point operations.
It employs a multi-step LIF spiking neuron model with temporal encoding, achieving competitive accuracy on CIFAR benchmarks while cutting energy costs.
The design supports efficient in-memory computation and transparent interpretability through techniques like spiking Grad-CAM for clear attention mapping.

The Spiking STDP Transformer (S $^{2}$ TDPT) is a neuromorphic deep learning architecture that models self-attention using spike-timing-dependent plasticity (STDP), leveraging principles of biological learning to enable energy-efficient, hardware-friendly, and interpretable transformer models. The framework is specifically designed for deployment in neuromorphic computing environments, targeting the fundamental limitations of conventional transformer attention mechanisms, such as energy inefficiency, reliance on floating-point operations, and the von Neumann memory bottleneck. Notably, S $^{2}$ TDPT demonstrates competitive accuracy on benchmark datasets with a substantial reduction in inference energy compared to standard artificial neural network (ANN) transformers (Mondal et al., 18 Nov 2025).

1. Model Architecture and Spiking Neuron Dynamics

S $^{2}$ TDPT employs a hierarchical encoder structure operating on temporally replicated static images (e.g., CIFAR-10/100), processed over four discrete simulation timesteps. The initial input $I \in \mathbb{R}^{T \times C \times H \times W}$ is transformed through Spiking Patch Splitting (SPS), producing a patch-wise membrane potential tensor $U_0 \in \mathbb{R}^{T \times N \times D}$ , where $N$ is the spatial token count and $D$ the embedding dimension.

Each encoder layer $\ell$ includes:

S $^{2}$ TDPSA: an STDP-based self-attention sublayer,
Two-layer spiking MLP,
Pre- and post-attention membrane residual connections.

Neuron dynamics follow the multi-step leaky-integrate-and-fire (LIF) model: $U[t] = H[t-1] + X[t]$

$S[t] = \Theta(U[t] - V_{th})$

$H[t] = V_{reset} \cdot S[t] + \beta U[t] (1 - S[t])$

where $\Theta$ is the Heaviside step, $X[t]$ the synaptic input, $V_{th}$ the threshold, $V_{reset}$ the reset potential, and $\beta$ the leak factor. Surrogate gradients facilitate end-to-end training via backpropagation.

2. STDP-Based Self-Attention Mechanism

Attention within S $^{2}$ TDPT is computed by embedding query-key similarity into local synaptic weights through STDP, completely eschewing non-biological operations such as dot-products or softmax normalization.

Temporal Coding: Binary spike tensors $Q_s, K_s \in \{0,1\}^{B \times N \times D_H}$ encode queries and keys, respectively. Spike counts $r_Q(i), r_K(j)$ are mapped to latencies $t_Q(i), t_K(j)$ , enabling temporal encoding of tokens. Temporal differences $\Delta t_{ij} = t_Q(i) - t_K(j)$ serve as the input to the STDP kernel.

STDP Kernel: Synaptic modification follows an asymmetric exponential rule: $\Delta w_{ij}(\Delta t) = \begin{cases} A_{+} \exp\bigl(\frac{\Delta t}{\tau_{+}}\bigr), & \Delta t < 0 \text{ (LTP)}\ -A_{-} \exp\bigl(-\frac{\Delta t}{\tau_{-}}\bigr), & \Delta t \geq 0 \text{ (LTD)} \end{cases}$ with potentiation ( $A_{+}$ ), depression ( $A_{-}$ ) coefficients, and time constants $\tau_{+}, \tau_{-}$ .

Attention Weighting: An offset $w_{offset}$ ensures positivity in $A_{ij} = \Delta w_{ij} + w_{offset}$ (no explicit softmax or dot-product). The attended value is aggregated by addition-only operations, supporting efficient in-memory computation.

3. Input Representation and Temporal Dynamics

Input images are temporally broadcast, with SPS performing patch-wise spiking convolutional processing:

$Z_0 = \text{BN}(\text{Conv2d}(I))$
$Z_{SPE} = \text{BN}(\text{Conv2d}(\text{SN}(Z_i)))$
$Z_{SPED} = \text{BN}(\text{Conv2d}(\text{MP}(\text{SN}(Z_j))))$

Tokens are assigned via convolutional stages, and multi-timestep inference propagates membrane potentials through attention and MLP blocks per simulation step. Global pooling aggregates the outputs for final classification through a fully connected layer and softmax.

4. Training Protocol and Optimization

STDP modulates attention in the forward pass but does not serve as a learning rule for projection weights. All learnable parameters (SPS conv weights, MLP weights, scaling factors, thresholds, and offsets) are optimized end-to-end with backpropagation.

Key training parameters:

Batch size: 64
Timesteps: $T=4$
Learning rate: $1 \times 10^{-3}$ (cosine/step decay)
Epochs: 200
STDP kernel ( $A_+ = A_- = 0.1$ , $\tau_+ = \tau_- = 2.0$ )
Weight decay: $1 \times 10^{-4}$
Optimizer: Adam or SGD with momentum

Loss function: cross-entropy on final output $Y$ . Surrogate gradient techniques (e.g., piecewise linear $\Theta'$ ) are employed for gradient flow through spiking activations.

5. Empirical Performance and Energy Efficiency

On CIFAR benchmarks, S $^{2}$ TDPT achieves:

94.35\% top-1 accuracy (CIFAR-10)
78.08\% top-1 accuracy (CIFAR-100)

Energy consumption for a four-timestep CIFAR-100 inference is quantified by: $E^{S^2TDPT} = E_{AC} \sum SOP^{Conv} + \sum SOP^{SSA} + E_{MAC} \cdot FLOP^{Conv}_1$ with $E_{AC} = 0.9~\text{pJ}$ , $E_{MAC} = 46~\text{pJ}$ (45 nm CMOS), totaling 0.49 mJ per inference.

Relative efficiency:

88.47 % energy reduction versus standard ANN Transformer (4.25 mJ)
37.97 % reduction versus Spikformer (0.79 mJ)
Hardware alignment: neuromorphic crossbars (memristors, phase-change, 2D materials) supporting local STDP

6. Model Interpretability via Spiking Grad-CAM

S $^{2}$ TDPT supports transparent attention analysis using adapted Grad-CAM:

Class score gradients w.r.t. final block membrane potentials compute pooled channel weights $\alpha_k$
Heatmap $H = \text{ReLU}(\sum_k \alpha_k U_L^{(k)})$ overlays semantically relevant image regions
Spike Firing Rate Map (SFR): head/timestep/layer averaged spikes $SFR(i, j)$ highlights focal attention patterns

Observations indicate that activation maps align with object boundaries (car body, dog's head), and high SFR correlates with these features, substantiating interpretable, object-centric attention emergence in the STDP-driven architecture.

7. Comparison with Standard ANN Transformers

Architectural Distinctions: S $^{2}$ TDPT's attention weights are computed via local spike-timing (binary spikes, $\Delta t \rightarrow$ STDP kernel), entirely in-memory, with no floating-point dot-product or intermediate matrix storage typical to ANN transformers.

Computational Complexity:

ANN Transformers: $O(N^2D)$ MACs per head, explicit softmax
S $^{2}$ TDPT: $O(N^2)$ additions, lookup-based exponentials (or crossbar implementation), softmax eliminated

Memory and Hardware: ANN attention matrices impose quadratic memory traffic and the von Neumann bottleneck. S $^{2}$ TDPT stores synaptic weights locally, drastically reducing off-chip bandwidth.

Energy Profile: S $^{2}$ TDPT replaces high-power 32 bit MACs (46 pJ) with 1 bit ACs (0.9 pJ), mapping efficiently to crossbar hardware with event-driven, addition-only operations.

S $^{2}$ TDPT concretely demonstrates that biologically inspired spiking attention can address scaling, efficiency, and explainability challenges inherent to transformer models in neuromorphic contexts (Mondal et al., 18 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Attention via Synaptic Plasticity is All You Need: A Biologically Inspired Spiking Neuromorphic Transformer (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Spiking STDP Transformer (S$^{2}$TDPT).