Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale (2207.00032v1)

Published 30 Jun 2022 in cs.LG, cs.DC, and cs.PF
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Abstract: The past several years have witnessed the success of transformer-based models, and their scale and application scenarios continue to grow aggressively. The current landscape of transformer models is increasingly diverse: the model size varies drastically with the largest being of hundred-billion parameters; the model characteristics differ due to the sparsity introduced by the Mixture-of-Experts; the target application scenarios can be latency-critical or throughput-oriented; the deployment hardware could be single- or multi-GPU systems with different types of memory and storage, etc. With such increasing diversity and the fast-evolving pace of transformer models, designing a highly performant and efficient inference system is extremely challenging. In this paper, we present DeepSpeed Inference, a comprehensive system solution for transformer model inference to address the above-mentioned challenges. DeepSpeed Inference consists of (1) a multi-GPU inference solution to minimize latency while maximizing the throughput of both dense and sparse transformer models when they fit in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU and NVMe memory in addition to the GPU memory and compute to enable high inference throughput with large models which do not fit in aggregate GPU memory. DeepSpeed Inference reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x for throughput-oriented scenarios. Moreover, it enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over $50\%$ of A6000 peak).

Efficient Inference of Transformer Models at Scale

The paper "OURS: Enabling Efficient Inference of Transformer Models at Unprecedented Scale" addresses the challenge of designing highly performant and efficient inference systems for transformer models. Transformer-based models have grown significantly in size, with current models reaching up to trillion parameters. This growth demands diverse applications, such as latency-critical tasks and throughput-oriented tasks, which need to be deployed on single- or multi-GPU systems with various memory and storage types. As such, there is a critical need for a robust system capable of handling the evolving diversity in transformer models.

Key Contributions

The authors present "OURS", a comprehensive system designed to address these challenges by optimizing transformer model inference over multiple GPUs. The solution comprises two major components:

  1. Multi-GPU Inference Solution: This optimizes latency while maximizing throughput for both dense and sparse transformer models when logistical constraints allow models to fit within aggregate GPU memory.
  2. Heterogeneous Inference Solution: This leverages CPU, NVMe, and GPU resources to achieve high throughput when handling large models that cannot fit in aggregate GPU memory.

Key numerical results highlighted in the paper include:

  • Up to 7.3×7.3\times reduction in latency over state-of-the-art for latency-oriented scenarios.
  • Over 1.5×1.5\times improvement in throughput for throughput-oriented scenarios.
  • Ability to perform trillion parameter scale inference under real-time latency constraints by utilizing hundreds of GPUs.

Technical Approaches

The paper details several technical strategies integrated into "OURS":

  • Deep-Fusion: This technique addresses the latency issues associated with small batch sizes by fusing multiple operations, minimizing data transfer between GPU cores and global memory to reach peak bandwidth utilization for transformer kernels.
  • Custom GeMM Kernels: For small batch sizes, custom General Matrix Multiplication (GeMM) kernels are employed to improve memory-bandwidth utilization and to seamlessly integrate within fused kernels.
  • Inference Optimized Pipeline Parallelism: This approach reduces pipeline bubble overhead and improves scheduling efficiency, specifically for generative transformer models where execution time can vary between prompt processing and token generation stages.

Implications and Future Work

The proposed system, "OURS", not only meets existing requirements for efficient large-scale inference of massive models, but it also sets a foundation for future developments in AI. Transformer models will likely continue their growth in scale and complexity. Systems like "OURS" expand the accessibility of such models by supporting fewer resources and democratizing large model inference. Future work will undoubtedly focus on further improving the system's adaptivity, possibly through enhanced hardware-aware optimizations and more sophisticated network communication strategies.

In conclusion, "OURS" provides an expressive inference framework capable of dynamically managing and optimizing computational resources in line with varied and evolving transformer model requirements, marking significant progress in the domain of scalable AI model deployment.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Reza Yazdani Aminabadi (10 papers)
  2. Samyam Rajbhandari (21 papers)
  3. Minjia Zhang (54 papers)
  4. Ammar Ahmad Awan (15 papers)
  5. Cheng Li (1094 papers)
  6. Du Li (9 papers)
  7. Elton Zheng (2 papers)
  8. Jeff Rasley (10 papers)
  9. Shaden Smith (7 papers)
  10. Olatunji Ruwase (20 papers)
  11. Yuxiong He (59 papers)
Citations (247)
X Twitter Logo Streamline Icon: https://streamlinehq.com