Dive into Big Model Training (2207.11912v1)

Published 25 Jul 2022 in cs.LG and cs.DC

Abstract: The increasing scale of model size and continuous improvement of performance herald the arrival of the Big Model era. In this report, we explore what and how the big model training works by diving into training objectives and training methodologies. Specifically,training objectives describe how to leverage web-scale data to develop extremely capable and incredibly large models based on self-supervised learning, and training methodologies which are based on distributed training describe how to make big model training a reality. We summarize the existing training methodologies into three main categories: training parallelism, memory-saving technologies, and model sparsity design. Training parallelism can be categorized into data, pipeline, and tensor parallelism according to the dimension of parallelism that takes place. Memory-saving technologies are orthogonal and complementary to training parallelism. And model sparsity design further scales up the model size with a constant computational cost. A continuously updated paper list of big model training is provided at https://github.com/qhliu26/BM-Training.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces comprehensive strategies for big model training by leveraging self-supervised learning and categorizing parallelism methods.
The study details methodologies such as data, pipeline, and tensor parallelism alongside memory-saving techniques like activation checkpointing and ZeRO.
The paper highlights challenges in scaling resources and proposes future co-design approaches between algorithms, software, and hardware to address these gaps.

Overview of "Dive into Big Model Training"

The paper "Dive into Big Model Training" by Qinghua Liu and Yuxiang Jiang offers a comprehensive examination of the methodologies and objectives involved in training large-scale machine learning models, commonly referred to as Big Models. Through a systematic exploration, the authors categorize the prevalent techniques and highlight the challenges and considerations for this burgeoning area in deep learning.

Training Objectives and Methodologies

The paper begins by discussing the advent of the Big Model era, cited as a confluence of increasing model sizes and advancements in self-supervised learning (SSL). SSL allows models to leverage vast amounts of unannotated data to develop representations suitable for numerous downstream tasks, transcending the scalability constraints of traditional supervised learning. The authors affirm the efficacy of SSL across multiple domains, reflecting its transformative impact on model training and architecture design.

Crucially, the authors categorize big model training strategies into three primary areas: training parallelism, memory-saving technologies, and model sparsity design. Each of these facets plays a pivotal role in overcoming the computational and memory bottlenecks inherent in big model training.

Categories of Training Parallelism

Training parallelism is subdivided into data, pipeline, and tensor parallelism:

Data Parallelism (DP) involves distributing data across multiple devices or nodes, allowing each to compute gradients on a subset of the data before synchronizing the updates.
Pipeline Parallelism (PP) segments the model itself across different devices, each responsible for sequential stages in the computation. By overlapping the execution of forward and backward passes across multiple microbatches, pipeline parallelism seeks to reduce idle periods in computation.
Tensor Parallelism (TP) involves partitioning matrices or tensors themselves across devices, enabling more granular parallel computation, often utilized in transformer models.

Memory-Saving Technologies

Given the constrained memory capacity of conventional GPUs, memory-saving techniques become indispensable. The authors detail methods including:

Activation Checkpointing, which trades computation for reduced memory usage by selectively storing intermediate activations during the forward pass;
Mixed Precision Training, which reduces memory usage and speeds up calculations by using lower precision numerical formats;
Zero Redundancy Optimizer (ZeRO), which optimizes memory distribution by partitioning model states across devices, significantly reducing redundant memory usage.

Model Sparsity Design

The concept of model sparsity is exemplified by Mixture-of-Experts (MoE) architectures, which employ a sparse selection of sub-models or "experts" for each input. This design maintains computational efficiency while scaling model parameters trivially, enabling the training of models with trillions of parameters under constant resource constraints.

Challenges and Future Directions

The paper concludes by acknowledging the systemic challenges associated with Big Model training, particularly the disproportionate growth of model resource demands relative to hardware advancements. The authors suggest that future advancements will likely necessitate co-design efforts across algorithms, software, and hardware. Additionally, they reflect on the societal implications of Big Models, particularly concerning biases that might be encoded or amplified in models trained on uncontrolled and unfiltered datasets.

In summary, this paper presents a meticulous and systematic overview of Big Model training, addressing both the methodological underpinnings and emerging challenges. The insights provided are poised to guide future research and practical implementations in the field of large-scale model training, pushing the boundaries of what is feasible in machine learning.

PDF Markdown

Related Papers

GitHub

GitHub - qhliu26/BM-Training: Dive into Big Model Training (109 stars)