SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators (2405.00790v2)
Abstract: Emerging multi-model workloads with heavy models like recent LLMs significantly increased the compute and memory demands on hardware. To address such increasing demands, designing a scalable hardware architecture became a key problem. Among recent solutions, the 2.5D silicon interposer multi-chip module (MCM)-based AI accelerator has been actively explored as a promising scalable solution due to their significant benefits in the low engineering cost and composability. However, previous MCM accelerators are based on homogeneous architectures with fixed dataflow, which encounter major challenges from highly heterogeneous multi-model workloads due to their limited workload adaptivity. Therefore, in this work, we explore the opportunity in the heterogeneous dataflow MCM AI accelerators. We identify the scheduling of multi-model workload on heterogeneous dataflow MCM AI accelerator is an important and challenging problem due to its significance and scale, which reaches O(1056) even for a two-model workload on 6x6 chiplets. We develop a set of heuristics to navigate the huge scheduling space and codify them into a scheduler, SCAR, with advanced techniques such as inter-chiplet pipelining. Our evaluation on ten multi-model workload scenarios for datacenter multitenancy and AR/VR use-cases has shown the efficacy of our approach, achieving on average 27.6% and 29.6% less energy-delay product (EDP) for the respective applications settings compared to homogeneous baselines.
- Mcm-gpu: Multi-chip-module gpus for continued performance scalability. ACM SIGARCH Computer Architecture News, 45(2):320–332, 2017.
- ‘zeppelin’: An soc for multichip architectures. In 2018 IEEE International Solid-State Circuits Conference-(ISSCC), pages 40–42. IEEE, 2018.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Inter-layer scheduling space definition and exploration for tiled accelerators. In Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA), pages 1–17, 2023.
- Gemini: Mapping and architecture co-exploration for large-scale dnn chiplet accelerators. In 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 156–171. IEEE, 2024.
- Marvel: a data-centric approach for mapping deep learning operators on spatial accelerators. ACM Transactions on Architecture and Code Optimization (TACO), 19(1):1–26, 2021.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Shidiannao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pages 92–104, 2015.
- FacebookResearch. Hrvit-b1. https://github.com/facebookresearch/HRViT/blob/main/models/hrvit.py#L1125-L1155, 2022.
- Tangram: Optimized coarse-grained dataflow for scalable nn accelerators. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 807–820, 2019.
- 3d hand shape and pose estimation from a single rgb image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10833–10842, 2019.
- Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 681–697. IEEE, 2020.
- Applied machine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 620–629. IEEE, 2018.
- Deep residual learning for image recognition, 2015.
- Mind mappings: enabling efficient algorithm-accelerator mapping space search. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 943–958, 2021.
- Dosa: Differentiable model-based one-loop search for dnn accelerators. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, pages 209–224, 2023.
- Cosa: Scheduling by constrained optimization for spatial accelerators. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 554–566. IEEE, 2021.
- Centaur: A chiplet-based, hybrid sparse-dense accelerator for personalized recommendations. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 968–981. IEEE, 2020.
- In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture, pages 1–12, 2017.
- Enabling interposer-based disintegration of multi-core processors. In Proceedings of the 48th international symposium on Microarchitecture, pages 546–558, 2015.
- Confuciux: Autonomous hardware resource assignment for dnn accelerators using reinforcement learning. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 622–636. IEEE, 2020.
- Magma: An optimization framework for mapping multiple dnns on multiple accelerator cores. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 814–830. IEEE, 2022.
- Moca: Memory-centric, adaptive execution for multi-tenant deep neural networks. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 828–841. IEEE, 2023.
- Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 754–768, 2019.
- Maestro: A data-centric approach to understand reuse, performance, and hardware cost of dnn mappings. IEEE micro, 40(3):20–29, 2020.
- Heterogeneous dataflow accelerators for multi-dnn workloads. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 71–83. IEEE, 2021.
- Xrbench: An extended reality (xr) machine learning benchmark suite for the metaverse. Proceedings of Machine Learning and Systems, 5, 2023.
- Miso: exploiting multi-instance gpu capability on multi-tenant gpu clusters. In Proceedings of the 13th Symposium on Cloud Computing, pages 173–189, 2022.
- Planercnn: 3d plane detection and reconstruction from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4450–4459, 2019.
- Veltair: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 388–401, 2022.
- Tenet: A framework for modeling tensor dataflow based on relation-centric notation. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 720–733. IEEE, 2021.
- Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In 2018 IEEE international conference on robotics and automation (ICRA), pages 4796–4803. IEEE, 2018.
- Nasa: accelerating neural network design with a nas processor. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 790–803. IEEE, 2021.
- Meta. D2go. https://github.com/facebookresearch/d2go, 2022.
- Microsoft. Announcing microsoft copilot, your everyday ai companion. https://blogs.microsoft.com/blog/2023/09/21/announcing-microsoft-copilot-your-everyday-ai-companion/, 2023.
- Microsoft. Azure openai service. https://azure.microsoft.com/en-us/products/ai-services/openai-service, 2023.
- MLCommons. Mlperf inference. https://mlcommons.org/benchmarks/inference-datacenter/, 2023.
- Pioneering chiplet technology and design for the amd epyc™ and ryzen™ processor families: Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 57–70. IEEE, 2021.
- NVIDIA. Nvdla deep learning accelerator. http://nvdla.org,2017., 2023.
- Massive data-centric parallelism in the chiplet era. arXiv preprint arXiv:2304.09389, 2023.
- Timeloop: A systematic approach to dnn accelerator evaluation. In 2019 IEEE international symposium on performance analysis of systems and software (ISPASS), pages 304–315. IEEE, 2019.
- Deep learning inference in facebook data centers: Characterization, performance optimizations and hardware implications. arXiv preprint arXiv:1811.09886, 2018.
- Qualcomm. Quacomm hexagon 680. https://www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.24-Monday-Epub/HC27.24.20-Multimedia-Epub/HC27.24.211-Hexagon680-Codrescu-Qualcomm.pdf, 2015.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
- Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 446–459. IEEE, 2020.
- U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publishing.
- Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 14–27, 2019.
- Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6783–6787. IEEE, 2021.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
- The microarchitecture of dojo, tesla’s exa-scale computer. IEEE Micro, 2023.
- Nn-baton: Dnn workload orchestration and chiplet granularity exploration for multichip accelerators. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 1013–1026. IEEE, 2021.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- 2.3 a 220gops 96-core processor with 6 chiplets 3d-stacked on an active interposer offering 0.6 ns/mm latency, 3tb/s/mm 2 inter-chiplet interconnects and 156mw/mm 2@ 82%-peak-efficiency dc-dc converters. In 2020 IEEE International Solid-State Circuits Conference-(ISSCC), pages 46–48. IEEE, 2020.
- Ai computing in light of 2.5 d interconnect roadmap: Big-little chiplets for in-memory acceleration. In 2022 International Electron Devices Meeting (IEDM), pages 23–6. IEEE, 2022.
- Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE international symposium on high performance computer architecture (HPCA), pages 331–344. IEEE, 2019.
- Sparseloop: An analytical approach to sparse tensor accelerator modeling. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1377–1395. IEEE, 2022.
- Hasco: Towards agile hardware and software co-design for tensor computation. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 1055–1068. IEEE, 2021.
- Eyecod: eye tracking system acceleration via flatcam-based algorithm & accelerator co-design. In Proceedings of the 49th Annual International Symposium on Computer Architecture, pages 610–622, 2022.
- Atomic dataflow based graph-level workload orchestration for scalable dnn accelerators. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 475–489. IEEE, 2022.