Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters (2402.10930v3)

Published 31 Jan 2024 in cs.AR, cs.AI, and cs.LG

Abstract: The self-attention mechanism distinguishes transformer-based LLMs apart from convolutional and recurrent neural networks. Despite the performance improvement, achieving real-time LLM inference on silicon remains challenging due to the extensive use of Softmax in self-attention. In addition to the non-linearity, the low arithmetic intensity significantly limits processing parallelism, especially when working with longer contexts. To address this challenge, we propose Constant Softmax (ConSmax), a software-hardware co-design that serves as an efficient alternative to Softmax. ConSmax utilizes differentiable normalization parameters to eliminate the need for maximum searching and denominator summation in Softmax. This approach enables extensive parallelization while still executing the essential functions of Softmax. Moreover, a scalable ConSmax hardware design with a bitwidth-split look-up table (LUT) can achieve lossless non-linear operations and support mixed-precision computing. Experimental results show that ConSmax achieves a minuscule power consumption of 0.2mW and an area of 0.0008mm2 at 1250MHz working frequency in 16nm FinFET technology. For open-source contribution, we further implement our design with the OpenROAD toolchain under SkyWater's 130nm CMOS technology. The corresponding power is 2.69mW and the area is 0.007mm2. ConSmax achieves 3.35x power savings and 2.75x area savings in 16nm technology, and 3.15x power savings and 4.14x area savings with the open-source EDA toolchain. In the meantime, it also maintains comparable accuracy on the GPT-2 model and the WikiText103 dataset. The project is available at https://github.com/ReaLLMASIC/ConSmax

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. A. Vaswani et al. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS), volume 30, 2017.
  2. T. Brown et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 1877–1901, 2020.
  3. M. Topal et al. Exploring transformers in natural language generation: Gpt, bert, and xlnet. In arXiv preprint arXiv:2102.08036, 2021.
  4. Gong et al. Enhanced transformer model for data-to-text generation. In Proceedings of the Workshop on Neural Generation and Translation, pages 148–156, 2019.
  5. A. Nejad et al. Exploring transformer text generation for medical dataset augmentation. In Proceedings of the Language Resources and Evaluation Conference, pages 4699–4708, 2020.
  6. A. Arnab et al. Vivit: A video vision transformer. In Proceedings of the international conference on computer vision (ICCV), pages 6836–6846, 2021.
  7. Z. Liu et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the international conference on computer vision (CVPR), pages 10012–10022, 2021.
  8. N. Jouppi et al. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA). Association for Computing Machinery, 2023.
  9. T. Ham et al. A3: Accelerating attention mechanisms in neural networks with approximation. In International Symposium on High Performance Computer Architecture (HPCA), pages 328–341, 2020.
  10. T. Ham et al. Elsa: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks. In International Symposium on Computer Architecture (ISCA), pages 692–705, 2021.
  11. H. Wang et al. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In International Symposium on High-Performance Computer Architecture (HPCA), pages 97–110, 2021.
  12. J. Stevens et al. Softermax: Hardware/software co-design of an efficient softmax for transformers. In Design Automation Conference (DAC), pages 469–474, 2021.
  13. A. Jiang et al. Mistral 7b. In arXiv preprint arXiv:2310.06825, 2023.
  14. H. Touvron et al. Llama: Open and efficient foundation language models, 2023.
  15. Z. Liu et al. Post-training quantization for vision transformer. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 28092–28103, 2021.
  16. S. Kim et al. I-bert: Integer-only bert quantization. In International conference on machine learning (ICML), pages 5506–5518, 2021.
  17. F. Frantar et al. Gptq: Accurate post-training quantization for generative pre-trained transformers. In arXiv preprint arXiv:2210.17323, 2022.
  18. J. Chee et al. Quip: 2-bit quantization of large language models with guarantees. In arXiv preprint arXiv:2307.13304, 2023.
  19. F. Frantar et al. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning (ICML), pages 10323–10337, 2023.
  20. S. Liu et al. 16.2 a 28nm 53.8tops/w 8b sparse transformer accelerator with in-memory butterfly zero skipper for unstructured-pruned nn and cim-based local-attention-reusable engine. In IEEE International Solid-State Circuits Conference (ISSCC), pages 250–252, 2023.
  21. Y. Qin et al. Fact: Ffn-attention co-optimized transformer architecture with eager correlation prediction. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA), pages 1–14, 2023.
  22. Z. Qu et al. Dota: detect and omit weak attentions for scalable transformer acceleration. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 14–26, 2022.
  23. S. Liu et al. Hardsea: Hybrid analog-reram clustering and digital-sram in-memory computing accelerator for dynamic sparse self-attention in transformer. In Transactions on Very Large Scale Integration (VLSI) Systems, pages 1–14, 2023.
  24. A. Yazdanbakhsh et al. Sparse attention acceleration with synergistic in-memory pruning and on-chip recomputation. In International Symposium on Microarchitecture (MICRO), pages 744–762, 2022.
  25. M. Zhou et al. Transpim: A memory-based acceleration via software-hardware co-design for transformer. In International Symposium on High-Performance Computer Architecture (HPCA), pages 1071–1085, 2022.
  26. S. Hong et al. Dfx: A low-latency multi-fpga appliance for accelerating transformer-based text generation. In International Symposium on Microarchitecture (MICRO), pages 616–630, 2022.
  27. S. Kao et al. Flat: An optimized dataflow for mitigating attention bottlenecks. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 295–310, 2023.
  28. T. Dao et al. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 16344–16359, 2022.
  29. T. Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In arXiv preprint arXiv:2307.08691, 2023.
  30. K. Hong et al. Flashdecoding++: Faster large language model inference on gpus. In arXiv preprint arXiv:2307.08691, 2023.
  31. Y. Zhang et al. Base-2 softmax function: Suitability for training and efficient hardware implementation. In Transactions on Circuits and Systems I: Regular Papers, volume 69, pages 3605–3618, 2022.
  32. K. Chen et al. Approximate softmax functions for energy-efficient deep neural networks. In IEEE Transactions on Very Large Scale Integration (VLSI) Systems, volume 31, pages 4–16, 2023.
  33. Y. Joonsang et al. Nn-lut: neural approximation of non-linear operations for efficient transformer inference. In Proceedings of the ACM/IEEE Design Automation Conference (DAC), pages 577–582, 2022.
  34. G. Du et al. Efficient softmax hardware architecture for deep neural networks. In Proceedings of the on Great Lakes Symposium on VLSI (GLSVLSI), pages 75–80, 2019.
  35. E. Banerjee et al. Exploring alternatives to softmax function. In arXiv preprint arXiv:2011.11538, 2020.
  36. A. Brébisson and P. Vincent. An exploration of softmax alternatives belonging to the spherical loss family. In arXiv preprint arXiv:1511.05042, 2015.
  37. J. Devlin et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In arXiv preprint arXiv:1810.04805, 2018.
  38. J. Xu et al. Mixed precision quantization of transformer language models for speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7383–7387, 2021.
  39. S. Merity et al. Pointer sentinel mixture models. 2016.
  40. V. Sanh et al. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In arXiv preprint arXiv:1910.01108, 2019.
  41. A. Radford et al. Language models are unsupervised multitask learners. 2019.
  42. A. Karpathy. nanogpt. https://github.com/karpathy/nanoGPT, 2022.
  43. B. Zhang and R. Sennrich. Root mean square layer normalization. In Advances in Neural Information Processing Systems (NeurIPS), 2019. Available at https://arxiv.org/abs/1910.07467.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.