Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Full-Stack Optimization for CAM-Only DNN Inference (2401.12630v1)

Published 23 Jan 2024 in cs.AR, cs.ET, and cs.LG

Abstract: The accuracy of neural networks has greatly improved across various domains over the past years. Their ever-increasing complexity, however, leads to prohibitively high energy demands and latency in von Neumann systems. Several computing-in-memory (CIM) systems have recently been proposed to overcome this, but trade-offs involving accuracy, hardware reliability, and scalability for large models remain a challenge. Additionally, for some CIM designs, the activation movement still requires considerable time and energy. This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors (APs) implemented using racetrack memory (RTM). We propose a novel compilation flow to optimize convolutions on APs by reducing their arithmetic intensity. By leveraging the benefits of RTM-based APs, this approach substantially reduces data transfers within the memory while addressing accuracy, energy efficiency, and reliability concerns. Concretely, our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators while retaining software accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (8)
  1. M. Imani, D. Peroni, Y. Kim, A. Rahimi, and T. Rosing, “Efficient neural network acceleration on gpgpu using content addressable memory,” in DATE.   IEEE, 2017, pp. 1026–1031.
  2. S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, “Learned step size quantization,” arXiv preprint arXiv:1902.08153, 2019.
  3. J. Diffenderfer and B. Kailkhura, “Multi-prize lottery ticket hypothesis: Finding accurate binary neural networks by pruning a randomly weighted network,” arXiv preprint arXiv:2103.09377, 2021.
  4. M. Scherer, G. Rutishauser, L. Cavigelli, and L. Benini, “Cutie: Beyond petaop/s/w ternary dnn inference acceleration with better-than-binary energy efficiency,” TCAD, vol. 41, no. 4, pp. 1020–1033, 2021.
  5. Y. Zha and J. Li, “Hyper-ap: Enhancing associative processing through a full-stack optimization,” in ISCA.   IEEE, 2020, pp. 846–859.
  6. P. Junsangsri, J. Han, and F. Lombardi, “A non-volatile low-power tcam design using racetrack memories,” in IEEE-NANO.   IEEE, 2016, pp. 525–528.
  7. R. Zhang, C. Tang, X. Sun, M. Li, W. Jin, P. Li, X. Cheng, and X. S. Hu, “Sky-tcam: Low-power skyrmion-based ternary content addressable memory,” TED, 2023.
  8. X. Peng, S. Huang, Y. Luo, X. Sun, and S. Yu, “Dnn+ neurosim: An end-to-end benchmarking framework for compute-in-memory accelerators with versatile device technologies,” in IEDM.   IEEE, 2019, pp. 32–5.
Citations (2)

Summary

We haven't generated a summary for this paper yet.