Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

A New Dataflow Implementation to Improve Energy Efficiency of Monolithic 3D Systolic Arrays (2401.03585v1)

Published 7 Jan 2024 in cs.ET

Abstract: Systolic arrays are popular for executing deep neural networks (DNNs) at the edge. Low latency and energy efficiency are key requirements in edge devices such as drones and autonomous vehicles. Monolithic 3D (MONO3D) is an emerging 3D integration technique that offers ultra-high bandwidth among processing and memory elements with a negligible area overhead. Such high bandwidth can help meet the ever-growing latency and energy efficiency demands for DNNs. This paper presents a novel implementation for weight stationary (WS) dataflow in MONO3D systolic arrays, called WS-MONO3D. WS-MONO3D utilizes multiple resistive RAM layers and SRAM with high-density vertical interconnects to multicast inputs and perform high-bandwidth weight pre-loading while maintaining the same order of multiply-and-accumulate operations as in native WS dataflow. Consequently, WS-MONO3D eliminates input and weight forwarding cycles and, thus, provides up to 40% improvement in energy-delay-product (EDP) over the native WS implementation in 2D at iso-configuration. WS-MONO3D also provides 10X improvement in inference per second per watt per footprint due to multiple vertical tiers. Finally, we also show that temperature impacts the energy efficiency benefits in WS-MONO3D.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing unit,” in ISCA, 2017, pp. 1–12.
  2. H. T. Kung et al., “Systolic building block for logic-on-logic 3d-ic implementations of convolutional neural networks,” in IEEE ISCAS.   IEEE, 2019, pp. 1–5.
  3. H. Li et al., “On-chip memory technology design space explorations for mobile deep neural network accelerators,” in ACM/IEEE DAC, 2019, pp. 1–6.
  4. P. Shukla, S. S. Nemtzow, V. F. Pavlidis, E. Salman, and A. K. Coskun, “Temperature-aware optimization of monolithic 3d deep neural network accelerators,” in Proceedings of the 26th Asia and South Pacific Design Automation Conference, 2021, pp. 709–714.
  5. J. M. Joseph et al., “Architecture, dataflow and physical design implications of 3d-ics for dnn-accelerators,” in IEEE ISQED, 2021, pp. 60–66.
  6. A. Samajdar et al., “A systematic methodology for characterizing scalability of DNN accelerators using scale-sim,” in IEEE ISPASS, 2020, pp. 58–68.
  7. M. M. S. Aly et al., “The n3xt approach to energy-efficient abundant-data computing,” IEEE, vol. 107, no. 1, pp. 19–48, 2018.
  8. S. R. Lee et al., “Multi-level switching of triple-layered taox rram with excellent reliability for storage class memory,” in 2012 Symposium on VLSI Technology (VLSIT).   IEEE, 2012, pp. 71–72.
  9. G. Murali, X. Sun, S. Yu, and S. K. Lim, “Heterogeneous mixed-signal monolithic 3-d in-memory computing using resistive ram,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 29, no. 2, pp. 386–396, 2020.
  10. Y. Yu and N. K. Jha, “Spring: A sparsity-aware reduced-precision monolithic 3D CNN accelerator architecture for training and inference,” IEEE TETC, 2020.
  11. Z. Zhang, X. Si, S. Srinivasa, A. K. Ramanathan, and M.-F. Chang, “Recent advances in compute-in-memory support for sram using monolithic 3-d integration,” IEEE Micro, vol. 39, no. 6, pp. 28–37, 2019.
  12. J. Lee et al., “The hardware and algorithm co-design for energy-efficient dnn processor on edge/mobile devices,” IEEE TCAS I: Regular Papers, vol. 67, no. 10, pp. 3458–3470, 2020.
  13. F. Andrieu et al., “A review on opportunities brought by 3d-monolithic integration for cmos device and digital circuit,” in 2018 International Conference on IC Design & Technology (ICICDT).   IEEE, 2018, pp. 141–144.
  14. T. Srimani et al., “Heterogeneous integration of beol logic and memory in a commercial foundry: Multi-tier complementary carbon nanotube logic and resistive ram at a 130 nm node,” in IEEE VLSIT, 2020, pp. 1–2.
  15. M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie, “Destiny: A tool for modeling emerging 3d nvm and edram caches,” in 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).   IEEE, 2015, pp. 1543–1546.
  16. S. Thoziyoor et al., “CACTI 6.5,” hpl.hp.com, 2009.
  17. Z. Yuan et al., “Pact: An extensible parallel thermal simulator for emerging integration and cooling technologies,” IEEE TCAD, vol. 41, no. 4, pp. 1048–1061, 2021.
  18. D. Bhattacharya and N. K. Jha, “Ultra-high density monolithic 3-d finfet sram with enhanced read stability,” IEEE TCAS I: Regular Papers, vol. 63, no. 8, pp. 1176–1187, 2016.
  19. S. K. Samal, D. Nayak, M. Ichihashi, S. Banna, and S. K. Lim, “Monolithic 3d ic vs. tsv-based 3d ic in 14nm finfet technology,” in 2016 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S).   IEEE, 2016, pp. 1–2.
  20. H. Kwon et al., “Heterogeneous dataflow accelerators for multi-DNN workloads,” in IEEE HPCA, 2021, pp. 71–83.
  21. K. Skadron et al., “Temperature-aware microarchitecture,” ACM SIGARCH Computer Architecture News, vol. 31, no. 2, pp. 2–13, 2003.
  22. M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).   IEEE, 2014, pp. 10–14.
  23. Q. Guo et al., “Online knowledge distillation via collaborative learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.