Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System

Published 11 Mar 2024 in cs.AR and cs.LG | (2403.06664v1)

Abstract: The recent huge advance of LLMs is mainly driven by the increase in the number of parameters. This has led to substantial memory capacity requirements, necessitating the use of dozens of GPUs just to meet the capacity. One popular solution to this is storage-offloaded training, which uses host memory and storage as an extended memory hierarchy. However, this obviously comes at the cost of storage bandwidth bottleneck because storage devices have orders of magnitude lower bandwidth compared to that of GPU device memories. Our work, Smart-Infinity, addresses the storage bandwidth bottleneck of storage-offloaded LLM training using near-storage processing devices on a real system. The main component of Smart-Infinity is SmartUpdate, which performs parameter updates on custom near-storage accelerators. We identify that moving parameter updates to the storage side removes most of the storage traffic. In addition, we propose an efficient data transfer handler structure to address the system integration issues for Smart-Infinity. The handler allows overlapping data transfers with fixed memory consumption by reusing the device buffer. Lastly, we propose accelerator-assisted gradient compression/decompression to enhance the scalability of Smart-Infinity. When scaling to multiple near-storage processing devices, the write traffic on the shared channel becomes the bottleneck. To alleviate this, we compress the gradients on the GPU and decompress them on the accelerators. It provides further acceleration from reduced traffic. As a result, Smart-Infinity achieves a significant speedup compared to the baseline. Notably, Smart-Infinity is a ready-to-use approach that is fully integrated into PyTorch on a real system. We will open-source Smart-Infinity to facilitate its use.

Abstract PDF HTML Upgrade to Chat

References (130)

Citations (5)

View on Semantic Scholar

Summary

The paper demonstrates that near-storage processing with CSDs and gradient compression reduces data traffic drastically, enabling up to a 2.11× training speedup.
It leverages FPGAs in CSDs integrated with PyTorch via DeepSpeed to offload computations from the host, mitigating storage bandwidth bottlenecks.
Extensive experiments validate its scalability and effectiveness for LLM fine-tuning, offering practical benefits for both academic research and industrial applications.

Smart-Infinity: Fast LLM Training using Near-Storage Processing on a Real System

Introduction

The paper "Smart-Infinity: Fast LLM Training using Near-Storage Processing on a Real System" introduces Smart-Infinity, an innovative solution that addresses the storage bandwidth bottleneck in storage-offloaded LLM training. This method incorporates Computational Storage Devices (CSDs) to perform parameter updates and gradient compression directly on storage-side accelerators, thereby reducing storage-related traffic and enhancing training speed.

System Architecture

Smart-Infinity leverages CSDs, which include FPGAs directly connected to storage, enabling computations to occur near data sources and alleviating conventional bandwidth constraints. The CSDs use SmartUpdate to relocate parameter updates from the host CPU to storage-side accelerators, reducing traffic from $8M$ (optimizer states and gradients) to $2M$. SmartComp further optimizes by compressing gradients, thus minimizing traffic through shared system interconnects.

Figure 1: A conceptual diagram of the storage-offloaded LLM training. Overview of (a) the forward pass, (b) the backward pass, and (c) the update (step) procedure.

Implementation and Optimizations

Smart-Infinity integrates seamlessly with PyTorch through DeepSpeed, offering a ready-to-use framework. It removes the need for host-side read/write of gradients and optimizer states by using FPGAs in CSDs for handling updates and compressions, capitalizing on linear bandwidth growth with the addition of more CSDs.

Figure 2: An example environment with CSDs (e.g., SmartSSDs).

SmartUpdate implements a data transfer handler that optimizes internal data flow between the SSD and FPGA and overlaps this process with computation to minimize latency. This structure achieves notable throughput improvements, evidenced by up to a 2.11 $\times$ training speedup over baseline approaches.

Figure 3: (a) LLM storage-offloaded training time breakdown with various model sizes. (b) Speedup from the increasing numbers of SSDs using RAID0 solution.

Performance Evaluation

Experiments demonstrate that Smart-Infinity achieves significant acceleration in training time, scaling effectively with the number of CSDs utilized. In scenarios where traditional storage-offloading hits bandwidth limitations, Smart-Infinity offers consistent speedup without sacrificing model accuracy, particularly in tasks such as fine-tuning.

Figure 4: Update procedure of the storage-offloaded training with (a) baseline and (b) SmartUpdate.

Applicability and Future Work

Smart-Infinity's ability to compress gradients on the GPU and perform updates near storage enhances its applicability across various domains such as model compression and distributed training. Future work could extend Smart-Infinity's concepts to broader system architectures involving dynamic resource sharing and further minimize GPU-host bandwidth dependencies.

Conclusion

Smart-Infinity provides a cost-effective solution to accelerate LLM training by utilizing near-storage processing, attacking the bandwidth bottleneck with practical storage-side computing and efficient data management techniques. Its implementation is readily available, demonstrating its practical utility in both academic research and industry applications, setting a precedent for future work in computational storage solutions for AI workloads.

Markdown