Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models (2307.02666v4)

Published 5 Jul 2023 in cs.AR

Abstract: LLMs such as OpenAI's ChatGPT and Google's Gemini have demonstrated unprecedented capabilities of autoregressive AI models across multiple tasks triggering disruptive technology innovations around the world. However, as models continue to grow the cost to serve these models also continues to grow threatening the democratization of LLMs. To address this issue, we propose Chiplet Cloud, a chiplet-based ASIC LLM-supercomputer architecture whose goal is to optimize the total cost of ownership (TCO) per generated token. This architecture is a highly parameterizable ASIC and server-level architecture leveraging thousands of replicated accelerator modules collaborating to scale-up the performance of LLMs at cloud-scale. To determine specific parameterizations of the Chiplet Cloud architecture, we implemented a two-phase hardware-software co-design methodology that can search the massive design space and fine tune the architecture across a collection of LLMs based on an accurate inference simulation. A common bottleneck for LLMs is the memory access performance therefore we introduce CC-MEM, a scalable on-chip memory system for Chiplet Cloud architectures. Using the CC-MEM, Chiplet Clouds can be built using only SRAMs for design points where the power and performance of memory access is critical. The CC-MEM also includes a compression decoder module to add support for sparse models without impacting the compute units using a Store-as-Compressed, Load-as-Dense mechanism. We evaluate Chiplet Cloud architectures across eight popular LLMs. Using fine tuned Chiplet Cloud servers we are able to achieve $97\times$ and $18\times$ improvement in TCO/Token over rented GPU and TPU clouds, or a $8.3\times$ and $3.7\times$ improvement over fabricated GPU and TPU clouds respectively. Chiplet Cloud can also support $1.7\times$ larger models with a sparsity of 60\%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, TX, USA, 1–15. https://doi.org/10.1109/SC41404.2022.00051
  2. Pathways: Asynchronous Distributed Dataflow for ML. arXiv:2203.12533 [cs] (March 2022). http://arxiv.org/abs/2203.12533 arXiv: 2203.12533.
  3. The Datacenter as a Computer: Designing Warehouse-Scale Machines. Synthesis lectures on computer architecture (2013), 209.
  4. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs] (July 2020). http://arxiv.org/abs/2005.14165 arXiv: 2005.14165.
  5. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs] (April 2022). http://arxiv.org/abs/2204.02311 arXiv: 2204.02311.
  6. Google Cloud. 2023. Cloud TPU pricing. https://cloud.google.com/tpu/pricing##v4-pricing
  7. ShiDianNao: Shifting Vision Processing Closer to the Sensor. SIGARCH Comput. Archit. News 43, 3S (jun 2015), 92–104. https://doi.org/10.1145/2872887.2750389
  8. A Configurable Cloud-Scale DNN Processor for Real-Time AI. In ISCA. Los Angeles, CA, 1–14. https://doi.org/10.1109/ISCA.2018.00012
  9. GitHub. 2023. GitHub Copilot Your AI pair programmer. https://github.com/features/copilot
  10. ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks. In ISCA. IEEE, Valencia, Spain, 692–705. https://doi.org/10.1109/ISCA52012.2021.00060
  11. Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product. In ISCA. 1–14. https://doi.org/10.1109/ISCA52012.2021.00010
  12. A domain-specific supercomputer for training deep neural networks. Commun. ACM 63, 7 (June 2020), 67–78. https://doi.org/10.1145/3360307
  13. Enabling interposer-based disintegration of multi-core processors. In MICRO. 546–558. https://doi.org/10.1145/2830772.2830808 ISSN: 2379-3155.
  14. Moonwalk: NRE Optimization in ASIC Clouds. In ASPLOS. ACM, Xi’an China, 511–526. https://doi.org/10.1145/3037697.3037749
  15. Simon Knowles. 2021. Graphcore Colossus Mk2 IPU. In Hot Chips.
  16. Lambda. 2023. The best prices for cloud GPUs. https://lambdalabs.com/service/gpu-cloud
  17. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv:1909.11942 [cs] (Feb. 2020). http://arxiv.org/abs/1909.11942 arXiv: 1909.11942.
  18. Ryan Liu and Chuang Feng. 2021. AI Compute Chip from Enflame. In Hot Chips. 1–27. https://doi.org/10.1109/HCS52781.2021.9567224 ISSN: 2573-2048.
  19. ASIC Clouds: Specializing the Datacenter. In ISCA. IEEE, Seoul, South Korea, 178–190. https://doi.org/10.1109/ISCA.2016.25
  20. Microsoft. 2020. Turing-NLG: A 17-billion-parameter language model by Microsoft. https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
  21. Maryam Mohsin. 2023. 10 Google Search Statistics You Need to Know in 2023. https://www.oberlo.com/blog/google-search-statistics#:~:text=We%20know%20that%20there%20are,Internet%20Live%20Stats%2C%202022).
  22. Pioneering Chiplet Technology and Design for the AMD EPYC™ and Ryzen™ Processor Families : Industrial Product. In ISCA. IEEE, Valencia, Spain, 57–70. https://doi.org/10.1109/ISCA52012.2021.00014
  23. Efficient large-scale language model training on GPU clusters using megatron-LM. In SC. ACM, St. Louis Missouri, 1–15. https://doi.org/10.1145/3458817.3476209
  24. OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt
  25. Efficiently Scaling Transformer Inference. http://arxiv.org/abs/2211.05102 arXiv:2211.05102 [cs].
  26. A 1.17-pJ/b, 25-Gb/s/pin Ground-Referenced Single-Ended Serial Link for Off- and On-Package Communication Using a Process- and Temperature-Adaptive Voltage Regulator. IEEE Journal of Solid-State Circuits 54, 1 (Jan. 2019), 43–54. https://doi.org/10.1109/JSSC.2018.2875092 Conference Name: IEEE Journal of Solid-State Circuits.
  27. Raghu Prabhakar and Sumti Jairath. 2021. SambaNova SN10 RDU:Accelerating Software 2.0 with Dataflow. In Hot Chips. 1–37. https://doi.org/10.1109/HCS52781.2021.9567250 ISSN: 2573-2048.
  28. Language Models are Unsupervised Multitask Learners. (2019), 24.
  29. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. http://arxiv.org/abs/2201.05596 arXiv:2201.05596 [cs].
  30. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture. In MICRO. ACM, Columbus OH USA, 14–27. https://doi.org/10.1145/3352460.3358302
  31. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs] (March 2020). http://arxiv.org/abs/1909.08053 arXiv: 1909.08053.
  32. Misha Smelyanskiy. 2019. Zion: Facebook Next- Generation Large Memory Training Platform. In Hot Chips. 1–22. https://doi.org/10.1109/HOTCHIPS.2019.8875650 ISSN: 2573-2048.
  33. Cerebras Systems. 2019. Wafer-Scale Deep Learning. In Hot Chips. https://www.hotchips.org/hc31/HC31_1.13_Cerebras.SeanLie.v02.pdf
  34. EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference. In MICRO. ACM, Virtual Event Greece, 830–844. https://doi.org/10.1145/3466752.3480095
  35. NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators. In ISCA. 1013–1026. https://doi.org/10.1109/ISCA52012.2021.00083 ISSN: 2575-713X.
  36. Ground-referenced signaling for intra-chip and short-reach chip-to-chip interconnects. In 2018 IEEE Custom Integrated Circuits Conference (CICC). 1–8. https://doi.org/10.1109/CICC.2018.8357077 ISSN: 2152-3630.
  37. Attention Is All You Need. arXiv:1706.03762 [cs] (Dec. 2017). http://arxiv.org/abs/1706.03762 arXiv: 1706.03762.
  38. A 0.11 pJ/Op, 0.32-128 Tops, Scalable Multi-Chip-Module-Based Deep Neural Network Accelerator Designed with A High-Productivity VLSI Methodology. In Hot Chips. 1–24. https://doi.org/10.1109/HOTCHIPS.2019.8875657 ISSN: 2573-2048.
  39. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. arXiv:2012.09852 [cs] (Jan. 2021). http://arxiv.org/abs/2012.09852 arXiv: 2012.09852.
  40. A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing. In ISSCC, Vol. 65. 1–3. https://doi.org/10.1109/ISSCC42614.2022.9731686 ISSN: 2376-8606.
  41. COMB-MCM: Computing-on-Memory-Boundary NN Processor with Bipolar Bitwise Sparsity Optimization for Scalable Multi-Chiplet-Module Edge Machine Learning. In ISSCC, Vol. 65. 1–3. https://doi.org/10.1109/ISSCC42614.2022.9731657 ISSN: 2376-8606.
Citations (8)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com