Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FastMoE: A Fast Mixture-of-Expert Training System (2103.13262v1)

Published 24 Mar 2021 in cs.LG, cs.CL, and cs.DC

Abstract: Mixture-of-Expert (MoE) presents a strong potential in enlarging the size of LLM to trillions of parameters. However, training trillion-scale MoE requires algorithm and system co-design for a well-tuned high performance distributed training system. Unfortunately, the only existing platform that meets the requirements strongly depends on Google's hardware (TPU) and software (Mesh Tensorflow) stack, and is not open and available to the public, especially GPU and PyTorch communities. In this paper, we present FastMoE, a distributed MoE training system based on PyTorch with common accelerators. The system provides a hierarchical interface for both flexible model design and easy adaption to different applications, such as Transformer-XL and Megatron-LM. Different from direct implementation of MoE models using PyTorch, the training speed is highly optimized in FastMoE by sophisticated high-performance acceleration skills. The system supports placing different experts on multiple GPUs across multiple nodes, enabling enlarging the number of experts linearly against the number of GPUs. The source of FastMoE is available at https://github.com/laekov/fastmoe under Apache-2 license.

Citations (79)

Summary

  • The paper introduces FastMoE, a distributed Mixture-of-Experts system for PyTorch that simplifies large-scale language model training with specialized CUDA kernels.
  • The paper demonstrates superior training speed and scalability on both single GPU setups and multi-node environments compared to existing frameworks.
  • Empirical results show improved training efficiency and lower loss metrics when FastMoE is applied for end-to-end GPT model training.

Introduction

The advent of large-scale LLMs has unlocked impressive advancements in natural language processing. Tapping into this potential, the Mixture-of-Expert (MoE) architecture has emerged as a promising solution for scaling LLMs up to trillions of parameters. Nevertheless, training such vast models is not without challenges. It calls for a synchronization between intricate algorithm designs and high-performance distributed training systems. While platforms exist for such immense training tasks, they're often tailored to specific hardware and not publicly available, especially for the PyTorch community that prominently uses GPUs.

FastMoE: System Overview

Addressing this gap, researchers have developed FastMoE - a distributed MoE training system compatible with PyTorch and common accelerators. FastMoE offers a hierarchical interface that simplifies model design and adaptation across various applications. Featuring dedicated CUDA kernels among other specialized optimizations, FastMoE enhances training speed and allows the distribution of "experts" across multiple GPUs and nodes, crucial for scaling model size. It also readily incorporates existing models, such as Megatron-LM, into its system, extending the flexibility to the developers. This open-source system has the potential to democratize the training of gargantuan LLMs by making it accessible and efficient on widely available hardware.

Empirical Performance

Empirical results showcase FastMoE's superior performance compared to existing PyTorch-based systems. It provides enhanced training speeds on single GPU setups and demonstrates robust scalability over multiple nodes with GPUs, pointing to its proficient utilization of hardware resources. Moreover, it exhibits tangible benefits in an end-to-end training context; when deployed for training a real GPT model, it achieves higher efficiency than non-MoE counterparts, as indicated by lower loss metrics.

Future Outlook

While FastMoE introduces a breakthrough in the training of MoE models on a large scale, its development journey continues. Future improvements aim to bolster the system's functionality for load-balancing among experts. The intent is to ease model handling utilities, such as loading and saving MoE configurations, and to enhance performance across multiple GPUs. The project's open-source nature encourages community contributions, setting the stage for collective progress in large model training on GPUS. Overall, FastMoE is a step forward in the pursuit of large-scale model training that is open and efficient, rallying the research community towards novel heights.

Github Logo Streamline Icon: https://streamlinehq.com