Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RACE-IT: A Reconfigurable Analog CAM-Crossbar Engine for In-Memory Transformer Acceleration (2312.06532v1)

Published 29 Nov 2023 in cs.AR, cs.ET, and cs.LG

Abstract: Transformer models represent the cutting edge of Deep Neural Networks (DNNs) and excel in a wide range of machine learning tasks. However, processing these models demands significant computational resources and results in a substantial memory footprint. While In-memory Computing (IMC) offers promise for accelerating Matrix-Vector Multiplications (MVMs) with high computational parallelism and minimal data movement, employing it for implementing other crucial operators within DNNs remains a formidable task. This challenge is exacerbated by the extensive use of Softmax and data-dependent matrix multiplications within the attention mechanism. Furthermore, existing IMC designs encounter difficulties in fully harnessing the benefits of analog MVM acceleration due to the area and energy-intensive nature of Analog-to-Digital Converters (ADCs). To tackle these challenges, we introduce a novel Compute Analog Content Addressable Memory (Compute-ACAM) structure capable of performing various non-MVM operations within Transformers. Together with the crossbar structure, our proposed RACE-IT accelerator enables efficient execution of all operations within Transformer models in the analog domain. Given the flexibility of our proposed Compute-ACAMs to perform arbitrary operations, RACE-IT exhibits adaptability to diverse non-traditional and future DNN architectures without necessitating hardware modifications. Leveraging the capability of Compute-ACAMs to process analog input and produce digital output, we also replace ADCs, thereby reducing the overall area and energy costs. By evaluating various Transformer models against state-of-the-art GPUs and existing IMC accelerators, RACE-IT increases performance by 10.7x and 5.9x, and reduces energy by 1193x, and 3.9x, respectively

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin, R. S. Williams, P. Faraboschi, W.-m. W. Hwu, J. P. Strachan, K. Roy et al., “Puma: A programmable ultra-efficient memristor-based accelerator for machine learning inference,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 715–731.
  2. J. Borghetti, G. S. Snider, P. J. Kuekes, J. J. Yang, D. R. Stewart, and R. S. Williams, “‘memristive’switches enable ‘stateful’logic operations via material implication,” Nature, vol. 464, no. 7290, pp. 873–876, 2010.
  3. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  4. J. Choi, H. Li, B. Kim, S. Hwang, and J. H. Ahn, “Accelerating transformer networks through recomposing softmax layers,” in 2022 IEEE International Symposium on Workload Characterization (IISWC).   IEEE, 2022, pp. 92–103.
  5. K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders as discriminators rather than generators,” in International Conference on Learning Representations, 2020.
  6. Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers.   Association for Computational Linguistics, 2019, pp. 2978–2988.
  7. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).   Association for Computational Linguistics, 2019, pp. 4171–4186.
  8. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
  9. S. Gupta, M. Imani, and T. Rosing, “Felix: Fast and energy-efficient logic in memory,” in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).   IEEE, 2018, pp. 1–7.
  10. R. Hanhan, E. Garzón, Z. Jahshan, A. Teman, M. Lanuzza, and L. Yavits, “Edam: Edit distance tolerant approximate matching content addressable memory,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22.   New York, NY, USA: Association for Computing Machinery, 2022, p. 495–507. [Online]. Available: https://doi.org/10.1145/3470496.3527424
  11. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  12. H. H. Holm, A. R. Brodtkorb, and M. L. Sætra, “Gpu computing with python: Performance, energy efficiency and usability,” Computation, vol. 8, no. 1, p. 4, 2020.
  13. M. Hu, J. P. Strachan, Z. Li, E. M. Grafals, N. Davila, C. Graves, S. Lam, N. Ge, J. J. Yang, and R. S. Williams, “Dot-product engine for neuromorphic computing: Programming 1t1m crossbar to accelerate matrix-vector multiplication,” in Proceedings of the 53rd Annual Design Automation Conference, ser. DAC ’16.   New York, NY, USA: Association for Computing Machinery, 2016. [Online]. Available: https://doi.org/10.1145/2897937.2898010
  14. M. Imani, A. Rahimi, D. Kong, T. Rosing, and J. M. Rabaey, “Exploring hyperdimensional associative memory,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017, pp. 445–456.
  15. M. Kang, H. Shin, J. Shin, and L.-S. Kim, “A framework for area-efficient multi-task bert execution on reram-based accelerators,” in 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD).   IEEE, 2021, pp. 1–9.
  16. N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher, “CTRL: A conditional transformer language model for controllable generation,” CoRR, vol. abs/1909.05858, 2019.
  17. I. Kouretas and V. Paliouras, “Hardware implementation of a softmax-like function for deep learning,” Technologies, vol. 8, no. 3, p. 46, 2020.
  18. R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018.
  19. S. Kvatinsky, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser, “Memristor-based material implication (imply) logic: Design principles and methodologies,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 10, pp. 2054–2066, 2013.
  20. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” in International Conference on Learning Representations, 2020.
  21. M. Le Gallo, R. Khaddam-Aljameh, M. Stanisavljevic, A. Vasilopoulos, B. Kersting, M. Dazzi, G. Karunaratne, M. Brändli, A. Singh, S. M. Müller, J. Büchel, X. Timoneda, V. Joshi, M. J. Rasch, U. Egger, A. Garofalo, A. Petropoulos, T. Antonakopoulos, K. Brew, S. Choi, I. Ok, T. Philip, V. Chan, C. Silvestre, I. Ahsan, N. Saulnier, V. Narayanan, P. A. Francese, E. Eleftheriou, and A. Sebastian, “A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference,” Nature Electronics, vol. 6, no. 9, pp. 680–693, Sep. 2023. [Online]. Available: https://doi.org/10.1038/s41928-023-01010-1
  22. C. Li, C. E. Graves, X. Sheng, D. Miller, M. Foltin, G. Pedretti, and J. P. Strachan, “Analog content-addressable memories with memristors,” Nature communications, vol. 11, no. 1, p. 1638, 2020.
  23. W. Li, M. Manley, J. Read, A. Kaul, M. S. Bakir, and S. Yu, “H3datten: Heterogeneous 3-d integrated hybrid analog and digital compute-in-memory accelerator for vision transformer self-attention,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2023.
  24. W. Li, H. Liu, R. Ding, M. Liu, P. Wang, and W. Yang, “Exploiting temporal contexts with strided transformer for 3d human pose estimation,” IEEE Transactions on Multimedia, vol. 25, pp. 1282–1293, 2022.
  25. T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang, “Pruning and quantization for deep neural network acceleration: A survey,” Neurocomputing, vol. 461, pp. 370–403, 2021.
  26. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  27. D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional neural networks using logarithmic data representation,” arXiv preprint arXiv:1603.01025, 2016.
  28. M. Nagel, M. v. Baalen, T. Blankevoort, and M. Welling, “Data-free quantization through weight equalization and bias correction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1325–1334.
  29. Nvidia, “Nvidia system management interface,” https://developer.nvidia.com/nvidia-system-management-interface, 2019.
  30. G. Pedretti, E. Ambrosi, and D. Ielmini, “Conductance variations and their impact on the precision of in-memory computing with resistive switching memory (rram),” in 2021 IEEE International Reliability Physics Symposium (IRPS), 2021, pp. 1–8.
  31. G. Pedretti, C. E. Graves, S. Serebryakov, R. Mao, X. Sheng, M. Foltin, C. Li, and J. P. Strachan, “Tree-based machine learning performed in-memory with memristive analog cam,” Nature communications, vol. 12, no. 1, p. 5806, 2021.
  32. G. Pedretti, J. Moon, P. Bruel, S. Serebryakov, R. M. Roth, L. Buonanno, T. Ziegler, C. Xu, M. Foltin, J. Ignowski et al., “X-time: An in-memory engine for accelerating machine learning on tabular data with cams,” arXiv preprint arXiv:2304.01285, 2023.
  33. C. Pei, Y. Zhang, Y. Zhang, F. Sun, X. Lin, H. Sun, J. Wu, P. Jiang, J. Ge, W. Ou et al., “Personalized re-ranking for recommendation,” in Proceedings of the 13th ACM conference on recommender systems, 2019, pp. 3–11.
  34. A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
  35. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  36. P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for squad,” arXiv preprint arXiv:1806.03822, 2018.
  37. M. S. Razlighi, M. Imani, F. Koushanfar, and T. Rosing, “Looknn: Neural network with no multiplication,” in 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE).   IEEE, 2017.
  38. E. Reggiani, R. Andri, and L. Cavigelli, “Flex-sfu: Accelerating dnn activation functions by non-uniform piecewise approximation,” arXiv preprint arXiv:2305.04546, 2023.
  39. F. A. Research, “Torchvision - pytorch vision library,” https://github.com/pytorch/vision, 2022.
  40. B. Rokh, A. Azarpeyvand, and A. Khanteymoori, “A comprehensive survey on model quantization for deep neural networks in image classification,” ACM Transactions on Intelligent Systems and Technology, 2023.
  41. V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
  42. A. Shafaei, Y. Wang, X. Lin, and M. Pedram, “Fincacti: Architectural analysis and modeling of caches with deeply-scaled finfet devices,” in 2014 IEEE Computer Society Annual Symposium on VLSI.   IEEE, 2014, pp. 290–295.
  43. A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 14–26, 2016.
  44. X. Sheng, C. E. Graves, S. Kumar, X. Li, B. Buchanan, L. Zheng, S. Lam, C. Li, and J. P. Strachan, “Low-conductance and multilevel cmos-integrated nanoscale oxide memristors,” Advanced Electronic Materials, vol. 5, no. 9, p. 1800876, 2019. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/aelm.201800876
  45. A. Siemon, S. Menzel, R. Waser, and E. Linn, “A complementary resistive switch-based crossbar array adder,” IEEE journal on emerging and selected topics in circuits and systems, vol. 5, no. 1, pp. 64–74, 2015.
  46. T. Song, X. Chen, X. Zhang, and Y. Han, “Brahms: Beyond conventional rram-based neural network accelerators using hybrid analog memory system,” in 2021 58th ACM/IEEE Design Automation Conference (DAC).   IEEE, 2021, pp. 1033–1038.
  47. A. Stillmaker and B. Baas, “Scaling equations for the accurate prediction of cmos device performance from 180 nm to 7 nm,” Integration, vol. 58, pp. 74–81, 2017.
  48. F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang, “Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer,” in Proceedings of the 28th ACM international conference on information and knowledge management, 2019, pp. 1441–1450.
  49. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  50. W. Wan, R. Kubendran, C. Schaefer, S. B. Eryilmaz, W. Zhang, D. Wu, S. Deiss, P. Raina, H. Qian, B. Gao, S. Joshi, H. Wu, H.-S. P. Wong, and G. Cauwenberghs, “A compute-in-memory chip based on resistive random-access memory,” Nature, vol. 608, no. 7923, pp. 504–512, Aug. 2022. [Online]. Available: https://doi.org/10.1038/s41586-022-04992-8
  51. A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” arXiv preprint arXiv:1804.07461, 2018.
  52. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Huggingface’s transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019.
  53. X. Yang, B. Yan, H. Li, and Y. Chen, “Retransformer: Reram-based processing-in-memory architecture for transformer acceleration,” in Proceedings of the 39th International Conference on Computer-Aided Design, 2020, pp. 1–9.
  54. G. Yuan, P. Behnam, Z. Li, A. Shafiee, S. Lin, X. Ma, H. Liu, X. Qian, M. N. Bojnordi, Y. Wang et al., “Forms: Fine-grained polarized reram-based in-situ computation for mixed-signal dnn accelerator,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).   IEEE, 2021, pp. 265–278.
  55. X. Zhao, Y. Wang, X. Cai, C. Liu, and L. Zhang, “Linear symmetric quantization of neural networks for low-precision integer hardware,” in International Conference on Learning Representations, 2020.
  56. Q. Zheng, S. Li, Y. Wang, Z. Li, Y. Chen, and H. H. Li, “Accelerating sparse attention with a reconfigurable non-volatile processing-in-memory architecture,” in 2023 60th ACM/IEEE Design Automation Conference (DAC).   IEEE, 2023, pp. 1–6.
  57. A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quantization: Towards lossless cnns with low-precision weights,” arXiv preprint arXiv:1702.03044, 2017.
  58. H. Zhu, K. Zhu, J. Gu, H. Jin, R. T. Chen, J. A. Incorvia, and D. Z. Pan, “Fuse and mix: Macam-enabled analog activation for energy-efficient neural acceleration,” in Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, 2022, pp. 1–9.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Lei Zhao (808 papers)
  2. Luca Buonanno (2 papers)
  3. Ron M. Roth (21 papers)
  4. Sergey Serebryakov (3 papers)
  5. Archit Gajjar (3 papers)
  6. John Moon (4 papers)
  7. Jim Ignowski (8 papers)
  8. Giacomo Pedretti (17 papers)
Citations (3)