Accurate Addition-Only Spiking Self-Attention
- Accurate Addition-Only Spiking Self-Attention (A²OS²A) is a neural mechanism that replaces resource-intensive multiplications with addition-only spiking operations to enhance energy and memory efficiency.
- It leverages a hybrid spiking scheme with binary, ternary, and full-precision representations to maintain competitive accuracy while simplifying the computation.
- The approach reduces computational complexity from quadratic to linear and has been empirically validated in graph, vision, and language applications on neuromorphic hardware.
Accurate Addition-Only Spiking Self-Attention (AOSA) is a neural attention mechanism that eliminates all multiplicative operations from self-attention in Transformer and Graph Transformer architectures, leveraging event-driven spiking computations for energy and memory efficiency. By replacing conventional dot-product and softmax-based attention with addition-only operations on (hybrid) spiking activations, AOSA enables integration into spiking neural networks (SNNs) and supports scalable deployment on neuromorphic hardware. Two recent independent lines of work, "SpikeGraphormer: A High-Performance Graph Transformer with Spiking Graph Attention" (Sun et al., 2024) and "Spiking Transformer: Introducing Accurate Addition-Only Spiking Self-Attention for Transformer" (Guo et al., 28 Feb 2025), have formalized and demonstrated AOSA in graph and vision/language contexts.
1. Motivation and Foundations
Self-attention is the core computational primitive underlying the Transformer architecture, but is dominated by energy- and memory-intensive floating-point multiplications (matrix multiplications and softmax normalization). In large-scale graphs or high-resolution vision problems, standard self-attention scales as in both time and space (for tokens/nodes and dimensions), and is poorly suited to event-driven hardware. The AOS0A mechanism reimagines self-attention: all matrix multiply–accumulate and softmax operations are replaced with sparse, addition-only interactions between spiking activations—primarily binary or ternary, produced by Leaky Integrate-and-Fire (LIF) neurons.
A1OS2A is motivated by the need to:
- Reduce computational and energy complexity from quadratic to linear in sequence or node count
- Eliminate hardware-intensive multipliers and exponential functions
- Retain competitive representational power and accuracy via hybrid precision (binary/ternary/real) representations
This approach enables practical, large-scale graph and sequence processing on SNNs and is highly compatible with neuromorphic hardware that excels at logic and addition-based operations (Sun et al., 2024, Guo et al., 28 Feb 2025).
2. Spiking Encoding and Hybrid Neuron Representation
A3OS4A replaces the conventional floating-point Q/K/V projections with spike-encoded features, leveraging both binary, ternary, and full-precision non-negative representations:
- Binary Q ("query"): 5; outputs 6 via a binary LIF neuron.
- Full-precision non-negative K ("key"): 7; full-precision via ReLU.
- Ternary V ("value"): 8; outputs 9 via a ternary LIF neuron.
Here, 0 (binary LIF) and 1 (ternary LIF) define spiking neuron dynamics with discrete output sets, Heaviside thresholding in the forward pass, and surrogate gradients during backpropagation for optimization stability: 2 This hybrid scheme mitigates the loss of representational entropy otherwise incurred by pure binary SNN attention—e.g., full-precision tensors 3 have 4 bits of capacity versus 5 for binary; ternarizing 6 and retaining ReLU for 7 recovers much of this capacity loss (Guo et al., 28 Feb 2025).
3. Addition-Only Spiking Self-Attention: Algorithmic Details
The A8OS9A mechanism replaces the attention computation
0
with addition-only operations, eliminating all multiplies, exponentials, and divisions.
3.1 Matrix-Free Spiking Attention
For graph attention (Sun et al., 2024):
- Encoding: Project features into spike-encoded 1 via LIF neurons over 2 timesteps.
- Mask construction: For each node-pair 3 and channel 4,
5
Aggregated as 6.
- Masked summation: Attention for query node 7,
8
Only integer additions, binary ANDs, and thresholding are used.
For sequence attention (Guo et al., 28 Feb 2025):
- Compute 9: For binary 0 and non-negative 1,
2
This reduces to addition of selected rows of 3.
- Weighted value sum: 4; summation over ternary 5—again addition-only.
- Spike output: Output passed through a spiking neuron: 6.
Softmax and scaling by 7 are eliminated, as 8 is always non-negative and bounded, given 9, 0 (Guo et al., 28 Feb 2025).
3.2 Graph Sparsity and Adjacency
For Graph Transformers, the binary attention mask is further sparsified by the adjacency matrix 1: mask bits are zeroed for non-adjacent node pairs, ensuring only 2 nonzero elements per channel when 3 in sparse graphs.
4. Computational Complexity and Energy Efficiency
A4OS5A achieves significant improvements in both computational and memory efficiency:
- Time complexity: For both graph and sequence, 6 per layer, versus 7 for conventional attention.
- Space complexity: 8 for graphs or 9 for sequences.
- Operational cost: All operations reduce to integer adds and bitwise AND, which on neuromorphic hardware are %%%%55056%%%% less energy-consuming than multiply-accumulate (MAC) operations; aggregate energy reduction is 10–2002 per layer (Sun et al., 2024, Guo et al., 28 Feb 2025).
This approach enables all-pair interactions in large-scale settings with limited hardware resources and supports deployment on architectures such as Intel Loihi and IBM TrueNorth.
5. Integration with Transformer and Graph Transformer Architectures
5.1 Spiking Graphormer Dual-Branch Design
SpikeGraphormer (Sun et al., 2024) incorporates A3OS4A (“Spiking Graph Attention”—SGA) in a dual-branch architecture:
- Global branch: SGA-driven Transformer layers enable all-pair node interactions using spike-based self-attention.
- Local branch: A lightweight sparse GNN (e.g., GCN) captures fine-grained neighborhood structure.
- Fusion: At each layer,
5
with 6.
5.2 Spiking Transformer Encoder
(Guo et al., 28 Feb 2025) applies A7OS8A within each encoder block of a vision transformer:
- Spiking patch splitting provides spike-encoded local features and positional embeddings.
- Encoder blocks alternate A9OS0A attention, ReLU-free MLPs, and binary/ternary spike processing, with residual pre-activation and global average pooling for classification.
6. Empirical Performance and Ablative Analysis
6.1 Graph and Sequence Classification
Key empirical results for A1OS2A-based models:
- OGB-Proteins: SpikeGraphormer achieves 79.62% ROC-AUC (vs. Nodeformer at 77.45%), train memory 3.7 GB (Sun et al., 2024).
- Amazon2M: 88.12% test accuracy (vs. Nodeformer at 87.85%), with large-batch, full-graph inference feasible on CPU.
- CIFAR-10/100: Spiking Transformer with A3OS4A achieves 94.91% (CIFAR-10) and 76.96% (CIFAR-100), surpassing spike-driven transformer baselines of equivalent size (Guo et al., 28 Feb 2025).
- ImageNet-1K: Spiking Transformer-10-512 (A5OS6A) attains 78.66% accuracy with only 4 timesteps and 36 M parameters.
6.2 Efficiency Gains
- On Cora, SpikeGraphormer reduces per-epoch training/inference times and cuts GPU memory by 10–207 (e.g., 93 MB vs. 239 MB for Nodeformer) (Sun et al., 2024).
- Spiking Transformer reduces per-layer energy by an order of magnitude compared to SNN-Transformer hybrids that retain dot-products (Guo et al., 28 Feb 2025).
6.3 Ablation and Information Capacity
- Fully binarized SNN attention loses most representational power (entropy); the hybrid binary/ReLU/ternary design recovers accuracy competitive with vanilla attention.
- Ablative removal of softmax or re-introduction of scaling leads to overfitting and loss of energy efficiency.
7. Broader Significance and Outlook
A8OS9A establishes a rigorous framework for enabling spiking self-attention at scale, reconciling the expressivity and flexibility of Transformer models with the operational and energy benefits of SNNs. The hybrid (binary/relu/ternary) encoding scheme is shown to be essential for practical accuracy. Deployments in cross-domain contexts (graph, image, text) indicate versatility. The architecture is poised to benefit deployments where memory or energy constraints are paramount, particularly in neuromorphic, edge, or battery-powered computing environments.
A0OS1A is referenced in SGA (“Spiking Graph Attention”) within SpikeGraphormer (Sun et al., 2024) and powers the self-attention sub-blocks of Spiking Transformer (Guo et al., 28 Feb 2025). Empirical results consistently demonstrate state-of-the-art SNN-Transformer accuracy with 10–2002 reductions in energy and memory cost relative to conventional self-attention.
Key References:
- "SpikeGraphormer: A High-Performance Graph Transformer with Spiking Graph Attention" (Sun et al., 2024)
- "Spiking Transformer:Introducing Accurate Addition-Only Spiking Self-Attention for Transformer" (Guo et al., 28 Feb 2025)