Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models (2307.02666v4)
Abstract: LLMs such as OpenAI's ChatGPT and Google's Gemini have demonstrated unprecedented capabilities of autoregressive AI models across multiple tasks triggering disruptive technology innovations around the world. However, as models continue to grow the cost to serve these models also continues to grow threatening the democratization of LLMs. To address this issue, we propose Chiplet Cloud, a chiplet-based ASIC LLM-supercomputer architecture whose goal is to optimize the total cost of ownership (TCO) per generated token. This architecture is a highly parameterizable ASIC and server-level architecture leveraging thousands of replicated accelerator modules collaborating to scale-up the performance of LLMs at cloud-scale. To determine specific parameterizations of the Chiplet Cloud architecture, we implemented a two-phase hardware-software co-design methodology that can search the massive design space and fine tune the architecture across a collection of LLMs based on an accurate inference simulation. A common bottleneck for LLMs is the memory access performance therefore we introduce CC-MEM, a scalable on-chip memory system for Chiplet Cloud architectures. Using the CC-MEM, Chiplet Clouds can be built using only SRAMs for design points where the power and performance of memory access is critical. The CC-MEM also includes a compression decoder module to add support for sparse models without impacting the compute units using a Store-as-Compressed, Load-as-Dense mechanism. We evaluate Chiplet Cloud architectures across eight popular LLMs. Using fine tuned Chiplet Cloud servers we are able to achieve $97\times$ and $18\times$ improvement in TCO/Token over rented GPU and TPU clouds, or a $8.3\times$ and $3.7\times$ improvement over fabricated GPU and TPU clouds respectively. Chiplet Cloud can also support $1.7\times$ larger models with a sparsity of 60\%.
- DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, TX, USA, 1–15. https://doi.org/10.1109/SC41404.2022.00051
- Pathways: Asynchronous Distributed Dataflow for ML. arXiv:2203.12533 [cs] (March 2022). http://arxiv.org/abs/2203.12533 arXiv: 2203.12533.
- The Datacenter as a Computer: Designing Warehouse-Scale Machines. Synthesis lectures on computer architecture (2013), 209.
- Language Models are Few-Shot Learners. arXiv:2005.14165 [cs] (July 2020). http://arxiv.org/abs/2005.14165 arXiv: 2005.14165.
- PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs] (April 2022). http://arxiv.org/abs/2204.02311 arXiv: 2204.02311.
- Google Cloud. 2023. Cloud TPU pricing. https://cloud.google.com/tpu/pricing##v4-pricing
- ShiDianNao: Shifting Vision Processing Closer to the Sensor. SIGARCH Comput. Archit. News 43, 3S (jun 2015), 92–104. https://doi.org/10.1145/2872887.2750389
- A Configurable Cloud-Scale DNN Processor for Real-Time AI. In ISCA. Los Angeles, CA, 1–14. https://doi.org/10.1109/ISCA.2018.00012
- GitHub. 2023. GitHub Copilot Your AI pair programmer. https://github.com/features/copilot
- ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks. In ISCA. IEEE, Valencia, Spain, 692–705. https://doi.org/10.1109/ISCA52012.2021.00060
- Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product. In ISCA. 1–14. https://doi.org/10.1109/ISCA52012.2021.00010
- A domain-specific supercomputer for training deep neural networks. Commun. ACM 63, 7 (June 2020), 67–78. https://doi.org/10.1145/3360307
- Enabling interposer-based disintegration of multi-core processors. In MICRO. 546–558. https://doi.org/10.1145/2830772.2830808 ISSN: 2379-3155.
- Moonwalk: NRE Optimization in ASIC Clouds. In ASPLOS. ACM, Xi’an China, 511–526. https://doi.org/10.1145/3037697.3037749
- Simon Knowles. 2021. Graphcore Colossus Mk2 IPU. In Hot Chips.
- Lambda. 2023. The best prices for cloud GPUs. https://lambdalabs.com/service/gpu-cloud
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv:1909.11942 [cs] (Feb. 2020). http://arxiv.org/abs/1909.11942 arXiv: 1909.11942.
- Ryan Liu and Chuang Feng. 2021. AI Compute Chip from Enflame. In Hot Chips. 1–27. https://doi.org/10.1109/HCS52781.2021.9567224 ISSN: 2573-2048.
- ASIC Clouds: Specializing the Datacenter. In ISCA. IEEE, Seoul, South Korea, 178–190. https://doi.org/10.1109/ISCA.2016.25
- Microsoft. 2020. Turing-NLG: A 17-billion-parameter language model by Microsoft. https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
- Maryam Mohsin. 2023. 10 Google Search Statistics You Need to Know in 2023. https://www.oberlo.com/blog/google-search-statistics#:~:text=We%20know%20that%20there%20are,Internet%20Live%20Stats%2C%202022).
- Pioneering Chiplet Technology and Design for the AMD EPYC™ and Ryzen™ Processor Families : Industrial Product. In ISCA. IEEE, Valencia, Spain, 57–70. https://doi.org/10.1109/ISCA52012.2021.00014
- Efficient large-scale language model training on GPU clusters using megatron-LM. In SC. ACM, St. Louis Missouri, 1–15. https://doi.org/10.1145/3458817.3476209
- OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt
- Efficiently Scaling Transformer Inference. http://arxiv.org/abs/2211.05102 arXiv:2211.05102 [cs].
- A 1.17-pJ/b, 25-Gb/s/pin Ground-Referenced Single-Ended Serial Link for Off- and On-Package Communication Using a Process- and Temperature-Adaptive Voltage Regulator. IEEE Journal of Solid-State Circuits 54, 1 (Jan. 2019), 43–54. https://doi.org/10.1109/JSSC.2018.2875092 Conference Name: IEEE Journal of Solid-State Circuits.
- Raghu Prabhakar and Sumti Jairath. 2021. SambaNova SN10 RDU:Accelerating Software 2.0 with Dataflow. In Hot Chips. 1–37. https://doi.org/10.1109/HCS52781.2021.9567250 ISSN: 2573-2048.
- Language Models are Unsupervised Multitask Learners. (2019), 24.
- DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. http://arxiv.org/abs/2201.05596 arXiv:2201.05596 [cs].
- Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture. In MICRO. ACM, Columbus OH USA, 14–27. https://doi.org/10.1145/3352460.3358302
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs] (March 2020). http://arxiv.org/abs/1909.08053 arXiv: 1909.08053.
- Misha Smelyanskiy. 2019. Zion: Facebook Next- Generation Large Memory Training Platform. In Hot Chips. 1–22. https://doi.org/10.1109/HOTCHIPS.2019.8875650 ISSN: 2573-2048.
- Cerebras Systems. 2019. Wafer-Scale Deep Learning. In Hot Chips. https://www.hotchips.org/hc31/HC31_1.13_Cerebras.SeanLie.v02.pdf
- EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference. In MICRO. ACM, Virtual Event Greece, 830–844. https://doi.org/10.1145/3466752.3480095
- NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators. In ISCA. 1013–1026. https://doi.org/10.1109/ISCA52012.2021.00083 ISSN: 2575-713X.
- Ground-referenced signaling for intra-chip and short-reach chip-to-chip interconnects. In 2018 IEEE Custom Integrated Circuits Conference (CICC). 1–8. https://doi.org/10.1109/CICC.2018.8357077 ISSN: 2152-3630.
- Attention Is All You Need. arXiv:1706.03762 [cs] (Dec. 2017). http://arxiv.org/abs/1706.03762 arXiv: 1706.03762.
- A 0.11 pJ/Op, 0.32-128 Tops, Scalable Multi-Chip-Module-Based Deep Neural Network Accelerator Designed with A High-Productivity VLSI Methodology. In Hot Chips. 1–24. https://doi.org/10.1109/HOTCHIPS.2019.8875657 ISSN: 2573-2048.
- SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. arXiv:2012.09852 [cs] (Jan. 2021). http://arxiv.org/abs/2012.09852 arXiv: 2012.09852.
- A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing. In ISSCC, Vol. 65. 1–3. https://doi.org/10.1109/ISSCC42614.2022.9731686 ISSN: 2376-8606.
- COMB-MCM: Computing-on-Memory-Boundary NN Processor with Bipolar Bitwise Sparsity Optimization for Scalable Multi-Chiplet-Module Edge Machine Learning. In ISSCC, Vol. 65. 1–3. https://doi.org/10.1109/ISSCC42614.2022.9731657 ISSN: 2376-8606.