Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CIM-MLC: A Multi-level Compilation Stack for Computing-In-Memory Accelerators (2401.12428v2)

Published 23 Jan 2024 in cs.AR and cs.CL

Abstract: In recent years, various computing-in-memory (CIM) processors have been presented, showing superior performance over traditional architectures. To unleash the potential of various CIM architectures, such as device precision, crossbar size, and crossbar number, it is necessary to develop compilation tools that are fully aware of the CIM architectural details and implementation diversity. However, due to the lack of architectural support in current popular open-source compiling stacks, existing CIM designs either manually deploy networks or build their own compilers, which is time-consuming and labor-intensive. Although some works expose the specific CIM device programming interfaces to compilers, they are often bound to a fixed CIM architecture, lacking the flexibility to support the CIM architectures with different computing granularity. On the other hand, existing compilation works usually consider the scheduling of limited operation types (such as crossbar-bound matrix-vector multiplication). Unlike conventional processors, CIM accelerators are featured by their diverse architecture, circuit, and device, which cannot be simply abstracted by a single level if we seek to fully explore the advantages brought by CIM. Therefore, we propose CIM-MLC, a universal multi-level compilation framework for general CIM architectures. We first establish a general hardware abstraction for CIM architectures and computing modes to represent various CIM accelerators. Based on the proposed abstraction, CIM-MLC can compile tasks onto a wide range of CIM accelerators having different devices, architectures, and programming interfaces. More importantly, compared with existing compilation work, CIM-MLC can explore the mapping and scheduling strategies across multiple architectural tiers, which form a tractable yet effective design space, to achieve better scheduling and instruction generation results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Abien Fred Agarap. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375, 2018.
  2. Hardware-software co-design for an analog-digital accelerator for machine learning. In 2018 IEEE International Conference on Rebooting Computing (ICRC), pages 1–13. IEEE, 2018.
  3. Accelerating deep neural networks in processing-in-memory platforms: Analog or digital approach? In 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages 197–202. IEEE, 2019.
  4. Puma: A programmable ultra-efficient memristor-based accelerator for machine learning inference. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 715–731, 2019.
  5. Onnx: Open neural network exchange. https://github.com/onnx/onnx, 2019.
  6. Conv-sram: An energy-efficient sram with in-memory dot-product computation for low-power convolutional neural networks. IEEE Journal of Solid-State Circuits, 54(1):217–230, 2018.
  7. Optimus: An operator fusion framework for deep neural networks. ACM Transactions on Embedded Computing Systems, 22(1):1–26, 2022.
  8. Neurosim: A circuit-level macro model for benchmarking neuro-inspired architectures in online learning. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(12):3067–3080, 2018.
  9. Tvm: An automated end-to-end optimizing compiler for deep learning. arXiv preprint arXiv:1802.04799, 2018.
  10. A framework for neural network architecture and compile co-optimization. ACM Transactions on Embedded Computing Systems, 22(1):1–24, 2022.
  11. A survey of accelerator architectures for deep neural networks. Engineering, 6(3):264–274, 2020.
  12. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE journal of solid-state circuits, 52(1):127–138, 2016.
  13. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, pages 27–39, Piscataway, NJ, USA, 2016. IEEE Press.
  14. Intel ngraph: An intermediate representation, compiler, and executor for deep learning. 2018.
  15. Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 31(7):994–1007, 2012.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  17. Tc-cim: Empowering tensor comprehensions for computing-in-memory. In IMPACT 2020-10th International Workshop on Polyhedral Compilation Techniques, 2020.
  18. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In 2018 ACM/IEEE 45Th annual international symposium on computer architecture (ISCA), pages 383–396. IEEE, 2018.
  19. In-memory data parallel processor. ACM SIGPLAN Notices, 53(2):1–14, 2018.
  20. Layer-puzzle: Allocating and scheduling multi-task on multi-core npus by using layer heterogeneity. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1–6. IEEE, 2023.
  21. Fast, energy-efficient, robust, and reproducible mixed-signal neuromorphic classifier based on embedded nor flash memory technology. In 2017 IEEE International Electron Devices Meeting (IEDM), pages 6–5. IEEE, 2017.
  22. Polyhedral-based compilation framework for in-memory neural network accelerators. ACM Journal on Emerging Technologies in Computing Systems (JETC), 18(1):1–23, 2021.
  23. A novel convolution computing paradigm based on nor flash array with high computing speed and energy efficiency. IEEE Transactions on Circuits and Systems I: Regular Papers, 66(5):1692–1703, 2019.
  24. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  25. Tare: task-adaptive in-situ reram computing for graph learning. In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 577–582. IEEE, 2021.
  26. Processing-in-sram acceleration for ultra-low power visual 3d perception. In Proceedings of the 59th ACM/IEEE Design Automation Conference, pages 295–300, 2022.
  27. ±plus-or-minus\pm±cim sram for signed in-memory broad-purpose computing from dsp to neural processing. IEEE Journal of Solid-State Circuits, 56(10):2981–2992, 2021.
  28. Fpsa: A full system stack solution for reconfigurable reram-based nn accelerator architecture. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 733–747, 2019.
  29. 15.1 a programmable neural-network inference accelerator based on scalable in-memory computing. In 2021 IEEE International Solid-State Circuits Conference (ISSCC), volume 64, pages 236–238. IEEE, 2021.
  30. An overview of in-memory processing with emerging non-volatile memory for data-intensive applications. In Proceedings of the 2019 on Great Lakes Symposium on VLSI, pages 381–386, 2019.
  31. The deep learning compiler: A comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 32(3):708–727, 2020.
  32. A survey of deep neural network architectures and their applications. Neurocomputing, 234:11–26, 2017.
  33. Reram-based processing-in-memory architecture for recurrent neural network acceleration. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 26(12):2781–2794, 2018.
  34. Max 2: An reram-based neural network accelerator that maximizes data reuse and area utilization. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 9(2):398–410, 2019.
  35. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  36. A coordinated model pruning and mapping framework for rram-based dnn accelerators. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2022.
  37. Glow: graph lowering compiler techniques for neural networks. corr abs/1805.00907 (2018). arXiv preprint arXiv:1805.00907, 2018.
  38. Amit Sabne. Xla: Compiling machine learning for peak performance. 2020.
  39. Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, pages 14–26, Piscataway, NJ, USA, 2016. IEEE Press.
  40. Occ: An automated end-to-end machine learning optimizing compiler for computing-in-memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(6):1674–1686, 2021.
  41. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  42. Pipelayer: A pipelined reram-based accelerator for deep learning. In 2017 IEEE international symposium on high performance computer architecture (HPCA), pages 541–552. IEEE, 2017.
  43. Xnor-rram: A scalable and parallel resistive synaptic architecture for binary neural networks. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1423–1428. IEEE, 2018.
  44. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730, 2018.
  45. Hitting the memory wall: Implications of the obvious. ACM SIGARCH computer architecture news, 23(1):20–24, 1995.
  46. 16.1 a 22nm 4mb 8b-precision reram computing-in-memory macro with 11.91 to 195.7 tops/w for tiny ai edge devices. In 2021 IEEE International Solid-State Circuits Conference (ISSCC), volume 64, pages 245–247. IEEE, 2021.
  47. Sparse reram engine: Joint exploration of activation and weight sparsity in compressed neural networks. In Proceedings of the 46th International Symposium on Computer Architecture, pages 236–249, 2019.
  48. Vesti: Energy-efficient in-memory computing accelerator for deep neural networks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 28(1):48–61, 2019.
  49. Xnor-sram: In-memory computing sram macro for binary/ternary deep neural networks. IEEE Journal of Solid-State Circuits, 55(6):1733–1743, 2020.
  50. Forms: Fine-grained polarized reram-based in-situ computation for mixed-signal dnn accelerator. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 265–278. IEEE, 2021.
  51. A configurable multi-precision cnn computing framework based on single bit rram. In Proceedings of the 56th Annual Design Automation Conference 2019, pages 1–6, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Songyun Qu (1 paper)
  2. Shixin Zhao (3 papers)
  3. Bing Li (374 papers)
  4. Yintao He (3 papers)
  5. Xuyi Cai (1 paper)
  6. Lei Zhang (1689 papers)
  7. Ying Wang (366 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com