Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models (2302.02599v2)

Published 6 Feb 2023 in cs.LG, cs.AI, and cs.DC

Abstract: In recent years, large-scale models have demonstrated state-of-the-art performance across various domains. However, training such models requires various techniques to address the problem of limited computing power and memory on devices such as GPUs. Some commonly used techniques include pipeline parallelism, tensor parallelism, and activation checkpointing. While existing works have focused on finding efficient distributed execution plans (Zheng et al. 2022) and activation checkpoint scheduling (Herrmann et al. 2019, Beaumont et al. 2021}, there has been no method proposed to optimize these two plans jointly. Moreover, ahead-of-time compilation relies heavily on accurate memory and computing overhead estimation, which is often time-consuming and misleading. Existing training systems and machine learning pipelines either physically execute each operand or estimate memory usage with a scaled input tensor. To address these challenges, we introduce a system that can jointly optimize distributed execution and gradient checkpointing plans. Additionally, we provide an easy-to-use symbolic profiler that generates memory and computing statistics for any PyTorch model with a minimal time cost. Our approach allows users to parallelize their model training on the given hardware with minimum code change based. The source code is publicly available at Colossal-AI GitHub or https://github.com/hpcaitech/ColossalAI

PDF Abstract

Overview of "Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models"

The paper presents Colossal-Auto, an automated system designed to optimize the parallelization and activation checkpointing of large-scale machine learning models. The system addresses significant challenges in training these models efficiently by jointly optimizing distributed execution plans and gradient checkpointing strategies, minimizing human intervention and reducing computational overhead.

Key Contributions

Joint Optimization Framework: The authors propose a system that concurrently optimizes intra-op parallelism and activation checkpointing. This dual-level approach results in a comprehensive solution that adapts to hardware configurations, ensuring optimal resource utilization.
Symbolic Profiler: The paper introduces a symbolic profiler capable of generating computational and memory usage statistics for PyTorch models with minimal execution time. This profiler enhances the accuracy of execution planning by providing detailed model insights without full runtime computation.
Hierarchical and Heuristic Methods: The system incorporates a hierarchical optimization method and a heuristic algorithm to convert tensor layout, supporting efficient sharding and layout conversions across arbitrary device mesh dimensions. This approach significantly reduces search space complexity and enhances execution plan adaptability.
Automatic Distributed Execution Code Generation: Colossal-Auto transforms serial model code into parallel execution code, integrating intra-op parallelism and dynamically optimized activation checkpointing. This transformation supports modern distributed training frameworks, such as ZeRO-Offload and PatrickStar.

System Design

Colossal-Auto operates with three primary components: the analyzer, the two-stage solver, and the generator.

Analyzer:

Utilizes the PyTorch FX module to create a static computation graph for profiling. It gathers performance data regarding both computation and communication, providing necessary inputs for optimization.

Two-Stage Solver:

Combines intra-op parallelism and activation checkpointing into a cohesive solution. This solver leverages an ILP formulation to identify optimal strategies for minimizing execution time under memory constraints.

Generator:

Utilizes compilation passes and code generation to translate optimized execution plans back into PyTorch-compatible code, thereby automating the model transformation process.

Implications and Future Prospects

The work on Colossal-Auto offers substantial implications for the field of AI, particularly in large-scale model training. By automating the optimization of distributed training plans, it reduces the dependency on expert knowledge and enables more researchers to train extensive models effectively.

The paper's contributions have significant potential to influence the development of AI systems by introducing automation in areas that typically require substantial manual engineering effort. This progress paves the way for further research and enhancement of automatic parallelization techniques, potentially expanding to cover both intra-op and inter-op parallelisms comprehensively.

In future developments, extending Colossal-Auto to universally apply across diverse models and integrating it more deeply with emerging AI infrastructures could further elevate its utility. Additionally, exploring its applications across varied hardware configurations, including new accelerator technologies, can unlock further capabilities in large-scale AI deployment.

In conclusion, Colossal-Auto represents a significant advancement in the automation of training large models, embodying a sophisticated blend of theoretical rigor and practical applicability in distributed AI systems.