Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching (2401.08156v1)

Published 16 Jan 2024 in cs.DC
GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

Abstract: Large-scale deep neural networks (DNNs), such as LLMs, have revolutionized the AI field and become increasingly popular. However, training or fine-tuning such models requires substantial computational power and resources, where the memory capacity of a single acceleration device like a GPU is one of the most important bottlenecks. Owing to the prohibitively large overhead (e.g., $10 \times$) of GPUs' native memory allocator, DNN frameworks like PyTorch and TensorFlow adopt a caching allocator that maintains a memory pool with a splitting mechanism for fast memory (de)allocation. Unfortunately, the caching allocator's efficiency degrades quickly for popular memory reduction techniques such as recomputation, offloading, distributed training, and low-rank adaptation. The primary reason is that those memory reduction techniques introduce frequent and irregular memory (de)allocation requests, leading to severe fragmentation problems for the splitting-based caching allocator. To mitigate this fragmentation problem, we propose a novel memory allocation framework based on low-level GPU virtual memory management called GPU memory lake (GMLake). GMLake employs a novel virtual memory stitching (VMS) mechanism, which can fuse or combine non-contiguous memory blocks with a virtual memory address mapping. GMLake can reduce an average of 9.2 GB (up to 25 GB) GPU memory usage and 15% (up to 33% ) fragmentation among eight LLM models on GPU A100 with 80 GB memory. GMLake is completely transparent to the DNN models and memory reduction techniques and ensures the seamless execution of resource-intensive deep-learning tasks. We have open-sourced GMLake at https://github.com/intelligent-machine-learning/glake/tree/main/GMLake.

Overview of GMLake: Efficient GPU Memory Defragmentation for Large-Scale DNNs

The paper "GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-Scale DNN Training with Virtual Memory Stitching" presents a novel approach to address memory fragmentation in training large-scale deep neural networks (DNNs) on GPUs. The central focus of this research is on the development and implementation of a GPU memory allocation framework named GMLake, which leverages low-level virtual memory management to optimize memory usage during DNN training.

Memory Fragmentation Issues in DNN Training

As DNN models grow in complexity and size, memory fragmentation has emerged as a significant challenge, particularly for LLMs. Memory fragmentation occurs when available memory is inadequately utilized due to uneven allocation and deallocation patterns. This paper identifies the limitations of existing GPU memory management techniques, such as the native memory allocator and caching allocators like those in PyTorch and TensorFlow. Traditional caching allocators often exhibit inefficiencies when dealing with memory reduction techniques like recomputation, offloading, and distributed training, leading to dramatic fragmentation.

GMLake's Innovative Approach

GMLake introduces a unique memory allocation strategy, employing a mechanism called Virtual Memory Stitching (VMS). This mechanism addresses fragmentation by effectively combining non-contiguous memory blocks, mapped through virtual memory addresses. This approach is novel in that it bypasses the need for contiguous physical memory, which is a common constraint in conventional caching allocators. The GMLake allocator works by creating a virtual memory pool, organized into primitive memory pools and stitched memory pools, to facilitate efficient allocation.

Numerical Results and Implications

The empirical evaluations of GMLake demonstrate significant reductions in memory usage and fragmentation across various large-scale models and training configurations. On Nvidia A100 GPUs, the framework shows a reduction in GPU memory usage by an average of 9.2 GB, reaching up to 25 GB, and a decrease in fragmentation by 15%, peaking at 33% across multiple LLMs. These results suggest that GMLake can effectively enhance the execution of resource-intensive DNN tasks without compromising training throughput.

Theoretical and Practical Implications

From a theoretical perspective, GMLake's approach to defragmentation and memory management could pave the way for further research into optimizing existing DL frameworks. By integrating VMS, researchers can explore novel algorithms that capitalize on non-contiguous memory mapping to minimize overhead and maximize resource usage.

Practically, the adoption of GMLake in mainstream deep learning frameworks has the potential to significantly impact how large-scale models are trained, particularly in environments where memory resources are a primary bottleneck. The seamless integration of GMLake promises minimal disruption while offering substantial improvements in memory efficiency.

Future Directions

The insights provided by this research open avenues for further explorations, including adaptation for other hardware accelerators beyond GPUs, such as TPUs or custom ASICs designed for AI workloads. Additionally, the techniques described in GMLake can be extended to offer more granular control over memory allocation, potentially benefiting applications in real-time machine learning systems where predictable memory usage is critical.

In summary, the GMLake framework provides a powerful solution for mitigating memory fragmentation in modern DL frameworks, reinforcing the scalability and sustainability of large-scale DNN deployments. This research is a step forward in optimizing AI infrastructure, ensuring that computational resources are used efficiently and effectively.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Cong Guo (63 papers)
  2. Rui Zhang (1138 papers)
  3. Jiale Xu (17 papers)
  4. Jingwen Leng (50 papers)
  5. Zihan Liu (102 papers)
  6. Ziyu Huang (20 papers)
  7. Minyi Guo (98 papers)
  8. Hao Wu (623 papers)
  9. Shouren Zhao (1 paper)
  10. Junping Zhao (6 papers)
  11. Ke Zhang (264 papers)
Citations (5)