Spikformer: When Spiking Neural Network Meets Transformer (2209.15425v2)

Published 29 Sep 2022 in cs.NE, cs.CV, and cs.LG

Abstract: We consider two biologically plausible structures, the Spiking Neural Network (SNN) and the self-attention mechanism. The former offers an energy-efficient and event-driven paradigm for deep learning, while the latter has the ability to capture feature dependencies, enabling Transformer to achieve good performance. It is intuitively promising to explore the marriage between them. In this paper, we consider leveraging both self-attention capability and biological properties of SNNs, and propose a novel Spiking Self Attention (SSA) as well as a powerful framework, named Spiking Transformer (Spikformer). The SSA mechanism in Spikformer models the sparse visual feature by using spike-form Query, Key, and Value without softmax. Since its computation is sparse and avoids multiplication, SSA is efficient and has low computational energy consumption. It is shown that Spikformer with SSA can outperform the state-of-the-art SNNs-like frameworks in image classification on both neuromorphic and static datasets. Spikformer (66.3M parameters) with comparable size to SEW-ResNet-152 (60.2M,69.26%) can achieve 74.81% top1 accuracy on ImageNet using 4 time steps, which is the state-of-the-art in directly trained SNNs models.

Authors (7)

Zhaokun Zhou (22 papers)
Yuesheng Zhu (30 papers)
Chao He (71 papers)
Yaowei Wang (149 papers)
Shuicheng Yan (275 papers)
Yonghong Tian (184 papers)
Li Yuan (141 papers)

Citations (176)

View on Semantic Scholar

Summary

Spikformer: Integrating Spiking Neural Networks with Transformers

The paper presents "Spikformer," a novel architecture that integrates Spiking Neural Networks (SNNs) with the Transformer model, specifically adapting the self-attention mechanism for SNNs. This integration addresses the challenges posed by the traditional self-attention mechanism's computational inefficiency and the biological plausibility constraints inherent in SNNs.

Overview of Spikformer

Spikformer is motivated by the complementary strengths of SNNs and Transformer models. SNNs are renowned for their energy efficiency and event-driven nature, while Transformers excel at capturing complex feature dependencies through self-attention mechanisms. The authors propose a Spiking Self Attention (SSA) mechanism which replaces the vanilla self-attention (VSA) and is optimized for spiking data by avoiding the use of softmax. Additionally, the SSA's computation does not require multiplications, rendering it efficient for the sparse and binary nature of spike calculations.

Technical Contributions

Spiking Self Attention (SSA): The SSA introduces spike-form Query, Key, and Value vectors without softmax, achieving computational efficiency through sparse operations. SSA is specially designed to operate within the constraints and operational characteristics of SNNs, ensuring efficient computation and biological plausibility.
Spikformer Architecture: The architecture consists of various components, including Spiking Patch Splitting (SPS) for transforming input images into spike-form patches, the Spikformer encoder blocks incorporating SSA, and a linear classification head. It shows adaptability on both static and neuromorphic datasets due to its unique combination of SNNs’ energy efficiency and Transformers’ feature capturing capabilities.
Performance Demonstration: The experimental results highlight Spikformer’s superior performance over contemporary SNN models on tasks like image classification with datasets such as ImageNet, CIFAR-10, and CIFAR-100. On the ImageNet dataset, Spikformer achieves a remarkable top-1 accuracy of 74.81% with low theoretical energy consumption, underscoring its ability to directly train and produce state-of-the-art results.

Implications and Future Directions

The development of Spikformer opens up avenues for deploying scalable, efficient, and accurate neural networks that blend the energy-efficient computation of SNNs with the feature-dependence modeling of Transformers. Practically, this can lead to more effective implementations for real-world applications where power consumption is a constraint, such as edge computing and autonomous systems.

Theoretically, Spikformer suggests a new direction for hybrid neural architectures that leverage the strengths of divergent paradigms—SNNs for efficient event-driven processing and Transformers for robust feature extraction. This work may prompt further exploration into optimizing other ANN components for compatibility with SNN characteristics and vice versa.

The authors do not discuss exhaustively the transferability of Spikformer across different types of tasks beyond image classification. Thus, future research could explore its extension to other domains like natural language processing or more complex video-based tasks. Additionally, since Spikformer's advantage is energy efficiency, research should be directed at integrating these architectures with hardware accelerators optimized for spiking computations to achieve real-world low-power implementations.

Overall, the Spikformer represents a significant advancement in bridging the computational efficiency of SNNs and the representational prowess of Transformers, setting the stage for innovative developments in artificial intelligence.