- The paper introduces Galvatron, an automatic distributed system that optimizes hybrid parallelism strategies for efficient training of large-scale foundation models.
- It employs a three-component architecture—profiling, dynamic programming-based search, and a versatile runtime—to customize parallelism at a layer level.
- Empirical benchmarks demonstrate throughput improvements up to 1.47x over manual tuning systems while reducing OOM errors and complexity.
Training large-scale Foundation Models, such as LLMs, is computationally intensive and requires distributed systems. A major challenge in achieving efficient distributed training is selecting the optimal combination of parallelization strategies (like data, tensor, pipeline, and sharded data parallelism), which can be highly complex and time-consuming, often relying on expert knowledge and extensive manual tuning. Galvatron is introduced as an automatic distributed system designed to overcome this challenge by intelligently identifying and implementing efficient hybrid parallelism strategies.
Galvatron achieves its goal through a three-component architecture:
- Profiler: This module analyzes both the hardware environment and the specific model architecture. It measures critical metrics such as inter-device communication bandwidth and single-device computational throughput. For the model, it profiles the computational patterns and memory requirements (including model states and activations) of individual layers. This detailed profiling provides the foundational data for cost models.
- Search Engine: Using the data from the profiler, the search engine is the core optimization component. It explores the vast configuration space of hybrid parallel strategies. It models this space, discards infeasible configurations (e.g., those exceeding memory limits), and constructs accurate cost models estimating the time and memory consumption for different strategies across each model layer. By employing dynamic programming, it identifies the most efficient combination of parallel strategies on a layer-by-layer basis, balancing memory usage with computation and communication costs tailored to the hardware constraints. The system includes a visualization plugin to help users understand the cost model.
- Runtime: This module is responsible for executing the distributed training based on the optimal strategy determined by the search engine. It supports a comprehensive set of parallel techniques, including data parallelism, tensor parallelism, pipeline parallelism, sharded data parallelism (like ZeRO/FSDP), sequence/context parallelism, and recomputation. It encapsulates these individual methods into efficient hybrid parallel models. The runtime is designed for ease of use, allowing users to integrate Galvatron with minimal code changes. A key part of the user interface involves functions like
get_hybrid_parallel_configs to retrieve the determined strategy and construct_hybrid_parallel_model to apply it to the user's original model definition.
The workflow typically involves: hardware profiling, model profiling, strategy searching using the profiled data to build a cost model and find the optimal configuration, and finally, executing the training using the runtime with the selected strategy. This automated process simplifies the user's role, requiring only specification of the hardware environment and model details.
Galvatron offers fine-grained, layer-level customization of parallelism, meaning different layers within a Transformer model can adopt distinct strategies for maximum efficiency. It is also designed to be versatile, supporting various model architectures beyond LLMs (e.g., vision models) and compatible with diverse hardware platforms like NVIDIA GPUs (H100, A100, 4090), Ascend NPUs [ascendAI2023], and Hygon DCUs.
Empirical evaluation demonstrates Galvatron's effectiveness. Benchmarks on various GPU clusters show that Galvatron achieves significantly higher throughput (up to 1.26–1.47x) compared to state-of-the-art frameworks like Megatron [DBLP:conf/sc/NarayananSCLPKV21] and DeepSpeed [DBLP:conf/kdd/RasleyRRH20] that rely on manual tuning. Galvatron's automatic adjustment provides consistent efficiency and can prevent Out-of-Memory (OOM) errors that might occur with suboptimal manual configurations.
The system is open-source and has been adopted in both academic research [DBLP:conf/asplos/WangWZFLXLLW025, DBLP:conf/asplos/WangZFMZZHLC25] and industrial applications at companies including ByteDance, Huawei, ZTE, and BAAI, highlighting its practical utility for efficient large-scale foundation model training.
For practical implementation, users would typically integrate Galvatron into their existing training script by replacing the standard model construction with Galvatron's API calls after defining the model and hardware configuration:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
import torch
hardware_config = {
"n_gpu": 8, # example
"gpu_type": "A100", # example
# ... other cluster specifics
}
model_config = {
"num_layers": 32, # example
"hidden_size": 4096, # example
# ... other model specifics
}
parallel_configs = galvatron.get_hybrid_parallel_configs(
model_class=original_model_definition,
model_config=model_config,
hardware_config=hardware_config
)
hybrid_model = galvatron.construct_hybrid_parallel_model(
model_class=original_model_definition,
model_config=model_config,
parallel_configs=parallel_configs
)
optimizer = torch.optim.Adam(hybrid_model.parameters(), lr=...) |
The complexity of selecting and coordinating diverse parallelism strategies is abstracted away, allowing practitioners to focus on model development and training rather than intricate system tuning. The system is available as open-source software at https://github.com/PKU-DAIR/Hetu-Galvatron, with detailed documentation available online.