SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks (2302.13939v5)
Abstract: As the size of LLMs continue to scale, so does the computational resources required to run it. Spiking Neural Networks (SNNs) have emerged as an energy-efficient approach to deep learning that leverage sparse and event-driven activations to reduce the computational overhead associated with model inference. While they have become competitive with non-spiking models on many computer vision tasks, SNNs have also proven to be more challenging to train. As a result, their performance lags behind modern deep learning, and we are yet to see the effectiveness of SNNs in language generation. In this paper, inspired by the Receptance Weighted Key Value (RWKV) LLM, we successfully implement `SpikeGPT', a generative LLM with binary, event-driven spiking activation units. We train the proposed model on two model variants: 45M and 216M parameters. To the best of our knowledge, SpikeGPT is the largest backpropagation-trained SNN model to date, rendering it suitable for both the generation and comprehension of natural language. We achieve this by modifying the transformer block to replace multi-head self attention to reduce quadratic computational complexity O(N2) to linear complexity O(N) with increasing sequence length. Input tokens are instead streamed in sequentially to our attention mechanism (as with typical SNNs). Our preliminary experiments show that SpikeGPT remains competitive with non-spiking models on tested benchmarks, while maintaining 20x fewer operations when processed on neuromorphic hardware that can leverage sparse, event-driven activations. Our code implementation is available at https://github.com/ridgerchu/SpikeGPT.
- Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901, 2020.
- Payal Dhar. The carbon impact of artificial intelligence. Nature Machine Intelligence, 2:423–5, 2020.
- Carbontracker: Tracking and predicting the carbon footprint of training deep learning models. arXiv preprint arXiv:2007.03051, 2020.
- OpenAI. ChatGPT: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/. Accessed: 2023-02-18.
- Wolfgang Maass. Networks of spiking neurons: the third generation of neural network models. Neural Networks, 10(9):1659–1671, 1997.
- Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro, 38(1):82–99, 2018.
- A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668–673, 2014.
- Intelligence processing units accelerate neuromorphic learning. arXiv preprint arXiv:2211.10725, 2022.
- Towards spike-based machine intelligence with neuromorphic computing. Nature, 575(7784):607–617, 2019.
- Deep learning with spiking neurons: Opportunities and challenges. Frontiers in Neuroscience, 12:774, 2018.
- Deep residual learning in spiking neural networks. Advances in Neural Information Processing Systems (NeurIPS), 34:21056–21069, 2021.
- Training spiking neural networks using lessons from deep learning. arXiv preprint arXiv:2109.12894, 2021.
- Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in neuroscience, 12:331, 2018.
- A system hierarchy for brain-inspired computing. Nature, 586(7829):378–384, 2020.
- Spiking neural networks for frame-based and event-based single object localization. arXiv preprint arXiv:2206.06506, 2022.
- Spiking-yolo: spiking neural network for energy-efficient object detection. In Proceedings of the AAAI conference on artificial intelligence (AAAI), volume 34, pages 11270–11277, 2020.
- Object detection with spiking neural networks on automotive event data. In International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2022.
- Attention is all you need. In Advances in neural information processing systems (NeurIPS), pages 5998–6008, 2017.
- Spiking convolutional neural networks for text classification. In The Eleventh International Conference on Learning Representations (ICLR), 2023.
- The fine line between dead neurons and sparsity in binarized spiking neural networks. arXiv preprint arXiv:2201.11915, 2022.
- Memristor-based binarized spiking neural networks: Challenges and applications. IEEE Nanotechnology Magazine, 16(2):14–23, 2022.
- Spikformer: When spiking neural network meets transformer. In The Eleventh International Conference on Learning Representations (ICLR), 2023.
- Towards energy-preserving natural language understanding with spiking neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:439–447, 2022.
- Conversion of artificial recurrent neural networks to spiking neural networks for low-power neuromorphic hardware. In IEEE International Conference on Rebooting Computing (ICRC), pages 1–8, 2016.
- Spikeformer: A novel architecture for training high-performance low-latency spiking neural network. arXiv preprint arXiv:2211.10686, 2022.
- Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine, 36(6):51–63, 2019.
- In-context learning and induction heads. CoRR, abs/2209.11895, 2022.
- Bo PENG. RWKV-LM. https://github.com/BlinkDL/RWKV-LM, 8 2021.
- An attention free transformer. arXiv preprint arXiv:2105.14103, 2021.
- Tcja-snn: Temporal-channel joint attention for spiking neural networks. arXiv preprint arXiv:2206.10177, 2022.
- Temporal-wise attention spiking neural networks for event streams classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10221–10230, 2021.
- Attention spiking neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP:1–18, 01 2023.
- Language modeling with gated convolutional networks. In International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research, pages 933–941, 2017.
- Noam Shazeer. GLU variants improve transformer. CoRR, abs/2002.05202, 2020.
- Matt Mahoney. Large text compression benchmark, 2011.
- Pointer sentinel mixture models. In 5th International Conference on Learning Representations (ICLR), 2017.
- Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. arXiv preprint cs/0506075, 2005.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1631–1642, 2013.
- Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. arXiv preprint cs/0409058, 2004.
- Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (NeurIPS), 32, 2019.
- Spikingjelly. https://github.com/fangwei123456/spikingjelly, 2020. Accessed: 2022-05-21.
- Stephen Merity. Single headed attention RNN: stop thinking with your head. CoRR, abs/1911.11423, 2019.
- Reformer: The efficient transformer. In 8th International Conference on Learning Representations (ICLR), 2020.
- Synthesizer: rethinking self-attention in transformer models. arXiv preprint arXiv:2005.00743, 2020.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 5156–5165, 2020.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- Yoon Kim. Convolutional neural networks for sentence classification. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, 2014.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (NAACL), pages 4171–4186, 2019.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning (ICML), 2020.
- Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
- Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015.
- Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5484–5495, 2021.
- Outlier suppression: Pushing the limit of low-bit transformer language models. In Advances in Neural Information Processing Systems (NeurIPS), 2022.