ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs (2502.21231v1)

Published 28 Feb 2025 in cs.DC, cs.AI, and cs.LG

Abstract: Scaling long-context ability is essential for LLMs. To amortize the memory consumption across multiple devices in long-context training, inter-data partitioning (a.k.a. Data Parallelism) and intra-data partitioning (a.k.a. Context Parallelism) are commonly used. Current training frameworks predominantly treat the two techniques as orthogonal, and establish static communication groups to organize the devices as a static mesh (e.g., a 2D mesh). However, the sequences for LLM training typically vary in lengths, no matter for texts, multi-modalities or reinforcement learning. The mismatch between data heterogeneity and static mesh causes redundant communication and imbalanced computation, degrading the training efficiency. In this work, we introduce ByteScale, an efficient, flexible, and scalable LLM training framework for large-scale mixed training of long and short sequences. The core of ByteScale is a novel parallelism strategy, namely Hybrid Data Parallelism (HDP), which unifies the inter- and intra-data partitioning with a dynamic mesh design. In particular, we build a communication optimizer, which eliminates the redundant communication for short sequences by data-aware sharding and dynamic communication, and further compresses the communication cost for long sequences by selective offloading. Besides, we also develop a balance scheduler to mitigate the imbalanced computation by parallelism-aware data assignment. We evaluate ByteScale with the model sizes ranging from 7B to 141B, context lengths from 256K to 2048K, on a production cluster with more than 12,000 GPUs. Experiment results show that ByteScale outperforms the state-of-the-art training system by up to 7.89x.

PDF Abstract

The paper "ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs" introduces , a novel framework designed to enhance the efficiency, flexibility, and scalability of training LLMs with extended context lengths. addresses critical challenges related to data heterogeneity and the limitations of static parallelism strategies in existing training frameworks. The core innovation is the Hybrid Data Parallelism (HDP) strategy, which unifies inter-data (Data Parallelism) and intra-data partitioning (Context Parallelism) with a dynamic mesh design. The framework incorporates a communication optimizer that eliminates redundant communication for short sequences through data-aware sharding and dynamic communication, and it compresses communication costs for long sequences via selective offloading. A balance scheduler is also introduced to mitigate imbalanced computation through parallelism-aware data assignment.

The key contributions of the paper are:

Hybrid Data Parallelism (HDP): A novel parallelism strategy unifying inter- and intra-data partitioning to evenly distribute tokens across devices, processing variable-length sequences within the range of [1, DP $\times$ CP].
Communication Optimizations: Implementation of data-aware sharding, which automatically constructs dynamic communication groups to process each sequence with a minimal number of devices. It also includes selective offloading to compress communication costs for long sequences.
Balance Strategy: A heuristic algorithm that reorganizes data assignment based on data and pipeline parallelism characteristics, assigning more micro-batches to devices with shorter execution times.

Background and Motivation

The paper begins by highlighting the increasing demand for LLMs with long-context capabilities, driven by applications such as document summarization, video understanding, agent interaction, and code completion. Scaling to long contexts presents challenges, primarily the quadratic scaling of memory and computation for self-attention mechanisms. Existing frameworks typically treat Data Parallelism (DP) and Context Parallelism (CP) as orthogonal techniques, establishing static communication meshes that are ill-suited for handling the variable sequence lengths common in LLM training data. This mismatch leads to redundant communication and imbalanced computation, degrading training efficiency.

The paper identifies two main challenges:

Redundant Communication: Current practices involve packing shorter sequences to the context length to prevent OOM errors, which forces all sequences, regardless of length, to undergo the same partitioning and communication processes as long sequences. This is particularly problematic for shorter sequences, where the $O(S^2)$ computation required to overlap $O(S)$ communication becomes inefficient.
Imbalanced Computation: Even token distribution across devices does not guarantee balanced execution times due to the $O(S^2)$ computational complexity associated with each token. This imbalance causes some devices to remain idle during synchronization.

Proposed Solution:

To address these challenges, the paper proposes , an efficient and scalable training framework. comprises three main components:

Profiler: Characterizes the environment, model configuration, and data distribution to build cost models.
Communication Optimizer: Enhances communication efficiency for both short and long sequences through data-aware sharding, dynamic communication, and selective offloading.
Balance Scheduler: Mitigates imbalanced computation by using parallelism-aware data assignment.

Communication Optimizer

The communication optimizer reduces overhead through:

Hybrid Data Parallelism (HDP): A parallelism strategy that unifies inter-data (DP) and intra-data partitioning (CP), distributing tokens evenly across devices and using devices in the range of [1, DP $\times$ CP] to flexibly process variable-length sequences.
NCCL Buffer Optimization: Creating a global communication group across all HDP ranks to enable P2P communication, reducing the overhead of establishing communication groups.
Optimizer States Sharding: Using ZeRO-1 across all HDP ranks to shard optimizer states, minimizing memory usage.
Data-Aware Selective Offloading: Activation offloading to CPU memory, which reduces the required number of HDP ranks. The method selectively offloads tokens based on FLOPs, using an offload ratio $r$ $r$ that minimizes the number of HDP ranks $D(s_i)$ $D (s_{i})$ required for a sequence of length $s_i$ $s_{i}$ :

$\qquad \argmin_{r} D(s_i)$
- $D(s_i)$ : Number of HDP ranks required for sequence $s_i$
- $s_i$ : Length of sequence $i$
subject to constraints on computation time, activation size, and offload ratio [Eq. (1)].

Balance Scheduler

The balance scheduler mitigates imbalances in DP and PP.

It redefines micro-batch processing, allowing different HDP ranks to process varying numbers of micro-batches to balance computational load.
It addresses PP imbalance by assigning sequences of similar length levels to separate pipelines, reducing PP bubbles.
It tackles DP imbalance by ensuring load balance at each time step when pipeline parallelism is not used.

The balance strategy involves sorting sequences by length, dividing them into buckets with equal FLOPs, and assigning sequences to target ranks based on execution times, using either a DP-Balance or PP-Balance approach, described in Algorithm 2.

Implementation Details

The implementation incorporates several optimizations:

Group Query Attention (GQA) to reduce communication volume.
Optimized dist-attn for packed sequences to avoid heterogeneous computation and communication within CP groups.
A remote dataloader using Ray to provide real-time scheduling and planning capabilities.
Fused SoftmaxCrossEntropy, which fuses operations into a single kernel, takes BF16 inputs, and performs online computations in FP32 precision, saving both time and memory.

Experimental Results

The paper evaluates on a production cluster with over 12,000 GPUs, using model sizes from 7B to 141B and context lengths from 256K to 2048K. Results show that achieves speedups of up to $7.89\times$ compared to existing training approaches.

The HDP naive solution reduces communication overhead but leaves some ranks idle due to imbalance. The HDP balance solution eliminates these bubble times.
The HDP balance solution achieves a higher speedup compared to the PP-Balance.
The HDP naive solution reduces the peak RDMA traffic, and the tensor core utilization increases.
Selective offloading complements the HDP naive solution.
The balance strategy stabilizes RDMA traffic and maintains consistent tensor core utilization.

The paper concludes by emphasizing that effectively addresses the challenges of training LLMs with long context windows, offering significant improvements in training efficiency and scalability.