Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts (2405.07518v2)

Published 13 May 2024 in cs.AR and cs.AI

Abstract: Monolithic LLMs like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them. In this paper, we describe how combining CoE, streaming dataflow, and a three-tier memory system scales the AI memory wall. We describe Samba-CoE, a CoE system with 150 experts and a trillion total parameters. We deploy Samba-CoE on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) - a commercial dataflow accelerator architecture that has been co-designed for enterprise inference and training applications. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. A dedicated inter-RDU network enables scaling up and out over multiple sockets. We demonstrate speedups ranging from 2$\times$ to 13$\times$ on various benchmarks running on eight RDU sockets compared with an unfused baseline. We show that for CoE inference deployments, the 8-socket RDU Node reduces machine footprint by up to 19$\times$, speeds up model switching time by 15$\times$ to 31$\times$, and achieves an overall speedup of 3.7$\times$ over a DGX H100 and 6.6$\times$ over a DGX A100.

Citations (4)

Summary

  • The paper introduces a composition of experts model that integrates multiple specialized models to achieve efficiency similar to a monolithic LLM.
  • The paper proposes a custom reconfigurable dataflow unit featuring a three-tier memory system—SRAM, HBM, and DDR—that overcomes the traditional AI memory wall.
  • The paper demonstrates up to 31x faster model switching, highlighting significant improvements in operational efficiency and cost-effectiveness.

Understanding Samba-CoE: A Composition of Experts System Optimized for the SambaNova SN40L

Unpacking Samba-CoE and Its Challenges

The field of AI, particularly within the context of LLMs, has been evolving with an increasing number of expert models, each tailored for specific tasks. Samba-CoE presents an innovative approach by integrating a collection of these smaller expert models to function as one cohesive unit. This system aims to perform at a level comparable to monolithic LLMs while being more memory and cost-efficient.

One of the primary challenges this paper explores is the computing architecture's ability to manage these models efficiently. The switch from a monolithic model to multiple expert models introduces complexity in model handling, including higher operational demands and the need for quick switching between models without slowing down the system.

The SambaNova SN40L: A Solution Geared for Efficiency

At the heart of addressing these challenges is the SambaNova SN40L Reconfigurable Dataflow Unit (RDU), custom-designed to support high-level compositions like Samba-CoE. With a unique three-tier memory system encompassing on-chip distributed SRAM, on-package High Bandwidth Memory (HBM), and off-package DDR DRAM, it provides a solution to the "memory wall" that hinders traditional accelerators.

A Closer Look at SN40L's Architectural Foundation:

  1. On-chip distributed SRAM - Facilitates rapid access to small, frequently used data, crucial for the on-the-fly operational demands of expert models.
  2. On-package HBM - Serves as an intermediate buffer with significantly greater capacity than the SRAM, suitable for larger but still frequently accessed data.
  3. Off-package DDR DRAM - Offers the largest storage capacity meant for less frequently used data but crucial for maintaining a wide array of expert models within the same infrastructure.

Efficient Execution: The Streaming Dataflow Model

Adding to the uniqueness of the SN40L is its use of the streaming dataflow model. This approach contrasts sharply with conventional operator fusion techniques commonly deployed in standard hardware. Where traditional systems may struggle with complex, intertwined data operations leading to bottlenecks, SN40L's streaming dataflow architecture allows for a fluid, continuous processing landscape.

Strengthening the System: Memory and Execution Speed

The deployment of Samba-CoE on the SambaNova SN40L shows a tremendous reduction in both the subjective complexities of model switching and footprint reduction. The system has demonstrated impressively faster speeds in switching between models —up to 31 times quicker compared to base configurations. This efficiency is not just about raw speed; it involves the capability to efficiently manage and transition between a significant number of expert models without excessive resource expenditure. Such performance is indicative of the SN40L’s robust memory hierarchy that enables swift on-demand data transfers and access.

Future Implications and Developments

The introduction and optimization of Samba-CoE within an infrastructure like SambaNova SN40L indicate a significant shift towards more modular, scalable AI systems. Systems like these can potentially democratize the use of advanced AI by making it accessible not just in terms of usability but also through affordability and economic scalability. The long-term implications could see a move away from monolithic models to these dynamic compositions of experts, offering tailored AI solutions without the prohibitive costs.

In terms of future developments, further enhancements in memory technology and dataflow architectures could allow even more refined management of expert systems. Additionally, improvements in the design and deployment of these expert models could lead to broader applications beyond current capabilities, penetrating industries that have not yet fully adopted AI due to cost or complexity barriers.

The continued development and refinement of systems like Samba-CoE paired with advanced accelerators like SambaNova SN40L not only highlight the technological advancements in AI but also pave the way for more inclusive, widespread access to cutting-edge AI technologies.

Youtube Logo Streamline Icon: https://streamlinehq.com