Communication Characterization of AI Workloads for Large-scale Multi-chiplet Accelerators
Abstract: Next-generation AI workloads are posing challenges of scalability and robustness in terms of execution time due to their intrinsic evolving data-intensive characteristics. In this paper, we aim to analyse the potential bottlenecks caused due to data movement characteristics of AI workloads on scale-out accelerator architectures composed of multiple chiplets. Our methodology captures the unicast and multicast communication traffic of a set of AI workloads and assesses aspects such as the time spent in such communications and the amount of multicast messages as a function of the number of employed chiplets. Our studies reveal that some AI workloads are potentially vulnerable to the dominant effects of communication, especially multicast traffic, which can become a performance bottleneck and limit their scalability. Workload profiling insights suggest to architect a flexible interconnect solution at chiplet level in order to improve the performance, efficiency and scalability of next-generation AI accelerators.
- N. P. Jouppi, D. Hyun Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin, G. Kurian, J. Laudon, S. Li, P. Ma, X. Ma, T. Norrie, N. Patil, S. Prasad, C. Young, Z. Zhou, and D. Patterson, “Ten Lessons From Three Generations Shaped Google’s TPUv4i: Industrial Product,” in Proceedings of ISCA 2021, 2021.
- Y. Gu, A. Subramaniyan, T. Dunn, A. Khadem, K.-Y. Chen, S. Paul, M. Vasimuddin, S. Misra, D. Blaauw, S. Narayanasamy et al., “GenDP: A Framework of Dynamic Programming Acceleration for Genome Sequencing Analysis,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023.
- V. Dadu, S. Liu, and T. Nowatzki, “PolyGraph: Exposing the Value of Flexibility for Graph Processing Accelerators,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 595–608.
- C.-J. Wu, R. Raghavendra, U. Gupta, B. Acun, N. Ardalani, K. Maeng, G. Chang, F. Aga, J. Huang, C. Bai et al., “Sustainable AI: Environmental implications, challenges and opportunities,” Proceedings of Machine Learning and Systems, vol. 4, pp. 795–813, 2022.
- R. Garg, H. Kwon, E. Qin, Y.-H. Chen, T. Krishna, and L. Lai, “PipeOrgan: Efficient Inter-operation Pipelining with Flexible Spatial Organization and Interconnects,” arXiv preprint arXiv:2405.01736, 2024.
- P. Chatarasi, H. Kwon, A. Parashar, M. Pellauer, T. Krishna, and V. Sarkar, “Marvel: A data-centric approach for mapping deep learning operators on spatial accelerators,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 19, no. 1, pp. 1–26, 2021.
- M. Odema, L. Chen, H. Kwon, and M. A. A. Faruque, “SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators,” arXiv preprint arXiv:2405.00790, 2024.
- R. Garg, E. Qin, F. Muñoz-Matrínez, R. Guirado, A. Jain, S. Abadal, J. L. Abellán, M. E. Acacio, E. Alarcón, S. Rajamanickam et al., “Understanding the design-space of sparse/dense multiphase GNN dataflows on spatial accelerators,” in 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2022, pp. 571–582.
- R. Hwang, T. Kim, Y. Kwon, and M. Rhu, “Centaur: A chiplet-based, hybrid sparse-dense accelerator for personalized recommendations,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 968–981.
- S.-C. Kao, H. Kwon, M. Pellauer, A. Parashar, and T. Krishna, “A Formalism of DNN Accelerator Flexibility,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 6, no. 2, pp. 1–23, 2022.
- H. Kwon, A. Samajdar, and T. Krishna, “Maeri: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects,” ACM SIGPLAN Notices, vol. 53, no. 2, pp. 461–475, 2018.
- B. Li, R. Arora, S. Samsi, T. Patel, W. Arcand, D. Bestor, C. Byun, R. B. Roy, B. Bergeron, J. Holodnak et al., “AI-enabling workloads on large-scale GPU-accelerated system: Characterization, opportunities, and implications,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 1224–1237.
- N. Beck, S. White, M. Paraschou, and S. Naffziger, “‘Zeppelin’: An SoC for multichip architectures,” in 2018 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2018, pp. 40–42.
- A. Kannan, N. E. Jerger, and G. H. Loh, “Enabling interposer-based disintegration of multi-core processors,” in Proceedings of the 48th international symposium on Microarchitecture, 2015, pp. 546–558.
- P. Vivet, E. Guthmuller, Y. Thonnart, G. Pillonnet, G. Moritz, I. Miro-Panadès, C. Fuguet, J. Durupt, C. Bernard, D. Varreau et al., “2.3 a 220gops 96-core processor with 6 chiplets 3d-stacked on an active interposer offering 0.6 ns/mm latency, 3tb/s/mm 2 inter-chiplet interconnects and 156mw/mm 2@ 82%-peak-efficiency dc-dc converters,” in 2020 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2020, pp. 46–48.
- Y. S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang, B. Keller, A. Klinefelter, N. Pinckney, P. Raina et al., “Simba: Scaling deep-learning inference with multi-chip-module-based architecture,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 14–27.
- R. Guirado, H. Kwon, S. Abadal, E. Alarcón, and T. Krishna, “Dataflow-architecture co-design for 2.5D DNN accelerators using wireless network-on-package,” in 2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2021, pp. 806–812.
- A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan et al., “Google workloads for consumer devices: Mitigating data movement bottlenecks,” in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018, pp. 316–331.
- V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
- V. Soteriou, H. Wang, and L. Peh, “A statistical traffic model for on-chip interconnection networks,” in 14th IEEE International Symposium on Modeling, Analysis, and Simulation. IEEE, 2006, pp. 104–116.
- S. Abadal, A. Mestres, R. Martinez, E. Alarcon, A. Cabellos-Aparicio, and R. Martinez, “Multicast on-chip traffic analysis targeting manycore NoC design,” in 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. IEEE, 2015, pp. 370–378.
- S. Lal, J. Lucas, M. Andersch, M. Alvarez-Mesa, A. Elhossini, and B. Juurlink, “GPGPU workload characteristics and performance analysis,” in 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV). IEEE, 2014, pp. 115–124.
- J. Cai, Z. Wu, S. Peng, Y. Wei, Z. Tan, G. Shi, M. Gao, and K. Ma, “Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators,” in 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2024, pp. 156–171.
- A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A systematic approach to DNN accelerator evaluation,” in 2019 IEEE international symposium on performance analysis of systems and software (ISPASS). IEEE, 2019, pp. 304–315.
- Y. N. Wu, J. S. Emer, and V. Sze, “Accelergy: An architecture-level energy estimation methodology for accelerator designs,” in 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2019, pp. 1–8.
- J. Cai, Y. Wei, Z. Wu, S. Peng, and K. Ma, “Inter-layer scheduling space definition and exploration for tiled accelerators,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–17.
- H. Kwon, P. Chatarasi, V. Sarkar, T. Krishna, M. Pellauer, and A. Parashar, “Maestro: A data-centric approach to understand reuse, performance, and hardware cost of DNN mappings,” IEEE micro, vol. 40, no. 3, pp. 20–29, 2020.
- Y. N. Wu, P.-A. Tsai, A. Parashar, V. Sze, and J. S. Emer, “Sparseloop: An analytical approach to sparse tensor accelerator modeling,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2022, pp. 1377–1395.
- Q. Huang, M. Kang, G. Dinh, T. Norell, A. Kalaiah, J. Demmel, J. Wawrzynek, and Y. S. Shao, “COSA: Scheduling by constrained optimization for spatial accelerators,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 554–566.
- C. Hong, Q. Huang, G. Dinh, M. Subedar, and Y. S. Shao, “DOSA: Differentiable model-based one-loop search for dnn accelerators,” in Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023, pp. 209–224.
- M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “Tangram: Optimized coarse-grained dataflow for scalable NN accelerators,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2019, pp. 807–820.
- S. Abadal, R. Martínez, J. Solé-Pareta, E. Alarcón, and A. Cabellos-Aparicio, “Characterization and modeling of multicast communication in cache-coherent manycore processors,” Computers & Electrical Engineering, vol. 51, pp. 168–183, 2016.
- S. Abadal, R. Guirado, H. Taghvaee, A. Jain, E. P. de Santana, P. H. Bolívar, M. Saeed, R. Negra, Z. Wang, K.-T. Wang et al., “Graphene-based wireless agile interconnects for massive heterogeneous multi-chip processors,” IEEE Wireless Communications, 2022.
- E. Taheri, M. A. Mahdian, S. Pasricha, and M. Nikdast, “SwInt: A Non-Blocking Switch-Based Silicon Photonic Interposer Network for 2.5 D machine learning Accelerators,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.