Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 173 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 37 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 124 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Galvatron: An Automatic Distributed System for Efficient Foundation Model Training (2504.21411v1)

Published 30 Apr 2025 in cs.DC, cs.AI, and cs.LG

Abstract: Galvatron is a distributed system for efficiently training large-scale Foundation Models. It overcomes the complexities of selecting optimal parallelism strategies by automatically identifying the most efficient hybrid strategy, incorporating data, tensor, pipeline, sharded data, and sequence parallelism, along with recomputation. The system's architecture includes a profiler for hardware and model analysis, a search engine for strategy optimization using decision trees and dynamic programming, and a runtime for executing these strategies efficiently. Benchmarking on various clusters demonstrates Galvatron's superior throughput compared to existing frameworks. This open-source system offers user-friendly interfaces and comprehensive documentation, making complex distributed training accessible and efficient. The source code of Galvatron is available at https://github.com/PKU-DAIR/Hetu-Galvatron.

Summary

  • The paper introduces Galvatron, an automatic distributed system that optimizes hybrid parallelism strategies for efficient training of large-scale foundation models.
  • It employs a three-component architecture—profiling, dynamic programming-based search, and a versatile runtime—to customize parallelism at a layer level.
  • Empirical benchmarks demonstrate throughput improvements up to 1.47x over manual tuning systems while reducing OOM errors and complexity.

Training large-scale Foundation Models, such as LLMs, is computationally intensive and requires distributed systems. A major challenge in achieving efficient distributed training is selecting the optimal combination of parallelization strategies (like data, tensor, pipeline, and sharded data parallelism), which can be highly complex and time-consuming, often relying on expert knowledge and extensive manual tuning. Galvatron is introduced as an automatic distributed system designed to overcome this challenge by intelligently identifying and implementing efficient hybrid parallelism strategies.

Galvatron achieves its goal through a three-component architecture:

  1. Profiler: This module analyzes both the hardware environment and the specific model architecture. It measures critical metrics such as inter-device communication bandwidth and single-device computational throughput. For the model, it profiles the computational patterns and memory requirements (including model states and activations) of individual layers. This detailed profiling provides the foundational data for cost models.
  2. Search Engine: Using the data from the profiler, the search engine is the core optimization component. It explores the vast configuration space of hybrid parallel strategies. It models this space, discards infeasible configurations (e.g., those exceeding memory limits), and constructs accurate cost models estimating the time and memory consumption for different strategies across each model layer. By employing dynamic programming, it identifies the most efficient combination of parallel strategies on a layer-by-layer basis, balancing memory usage with computation and communication costs tailored to the hardware constraints. The system includes a visualization plugin to help users understand the cost model.
  3. Runtime: This module is responsible for executing the distributed training based on the optimal strategy determined by the search engine. It supports a comprehensive set of parallel techniques, including data parallelism, tensor parallelism, pipeline parallelism, sharded data parallelism (like ZeRO/FSDP), sequence/context parallelism, and recomputation. It encapsulates these individual methods into efficient hybrid parallel models. The runtime is designed for ease of use, allowing users to integrate Galvatron with minimal code changes. A key part of the user interface involves functions like get_hybrid_parallel_configs to retrieve the determined strategy and construct_hybrid_parallel_model to apply it to the user's original model definition.

The workflow typically involves: hardware profiling, model profiling, strategy searching using the profiled data to build a cost model and find the optimal configuration, and finally, executing the training using the runtime with the selected strategy. This automated process simplifies the user's role, requiring only specification of the hardware environment and model details.

Galvatron offers fine-grained, layer-level customization of parallelism, meaning different layers within a Transformer model can adopt distinct strategies for maximum efficiency. It is also designed to be versatile, supporting various model architectures beyond LLMs (e.g., vision models) and compatible with diverse hardware platforms like NVIDIA GPUs (H100, A100, 4090), Ascend NPUs [ascendAI2023], and Hygon DCUs.

Empirical evaluation demonstrates Galvatron's effectiveness. Benchmarks on various GPU clusters show that Galvatron achieves significantly higher throughput (up to 1.26–1.47x) compared to state-of-the-art frameworks like Megatron [DBLP:conf/sc/NarayananSCLPKV21] and DeepSpeed [DBLP:conf/kdd/RasleyRRH20] that rely on manual tuning. Galvatron's automatic adjustment provides consistent efficiency and can prevent Out-of-Memory (OOM) errors that might occur with suboptimal manual configurations.

The system is open-source and has been adopted in both academic research [DBLP:conf/asplos/WangWZFLXLLW025, DBLP:conf/asplos/WangZFMZZHLC25] and industrial applications at companies including ByteDance, Huawei, ZTE, and BAAI, highlighting its practical utility for efficient large-scale foundation model training.

For practical implementation, users would typically integrate Galvatron into their existing training script by replacing the standard model construction with Galvatron's API calls after defining the model and hardware configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import torch

hardware_config = {
    "n_gpu": 8, # example
    "gpu_type": "A100", # example
    # ... other cluster specifics
}

model_config = {
    "num_layers": 32, # example
    "hidden_size": 4096, # example
    # ... other model specifics
}

parallel_configs = galvatron.get_hybrid_parallel_configs(
    model_class=original_model_definition,
    model_config=model_config,
    hardware_config=hardware_config
)

hybrid_model = galvatron.construct_hybrid_parallel_model(
    model_class=original_model_definition,
    model_config=model_config,
    parallel_configs=parallel_configs
)

optimizer = torch.optim.Adam(hybrid_model.parameters(), lr=...)

The complexity of selecting and coordinating diverse parallelism strategies is abstracted away, allowing practitioners to focus on model development and training rather than intricate system tuning. The system is available as open-source software at https://github.com/PKU-DAIR/Hetu-Galvatron, with detailed documentation available online.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: