Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automatic Graph Partitioning for Very Large-scale Deep Learning (2103.16063v1)

Published 30 Mar 2021 in cs.LG and cs.DC

Abstract: This work proposes RaNNC (Rapid Neural Network Connector) as middleware for automatic hybrid parallelism. In recent deep learning research, as exemplified by T5 and GPT-3, the size of neural network models continues to grow. Since such models do not fit into the memory of accelerator devices, they need to be partitioned by model parallelism techniques. Moreover, to accelerate training for huge training data, we need a combination of model and data parallelisms, i.e., hybrid parallelism. Given a model description for PyTorch without any specification for model parallelism, RaNNC automatically partitions the model into a set of subcomponents so that (1) each subcomponent fits a device memory and (2) a high training throughput for pipeline parallelism is achieved by balancing the computation times of the subcomponents. In our experiments, we compared RaNNC with two popular frameworks, Megatron-LM (hybrid parallelism) and GPipe (originally proposed for model parallelism, but a version allowing hybrid parallelism also exists), for training models with increasingly greater numbers of parameters. In the pre-training of enlarged BERT models, RaNNC successfully trained models five times larger than those Megatron-LM could, and RaNNC's training throughputs were comparable to Megatron-LM's when pre-training the same models. RaNNC also achieved better training throughputs than GPipe on both the enlarged BERT model pre-training (GPipe with hybrid parallelism) and the enlarged ResNet models (GPipe with model parallelism) in all of the settings we tried. These results are remarkable, since RaNNC automatically partitions models without any modification to their descriptions; Megatron-LM and GPipe require users to manually rewrite the models' descriptions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Masahiro Tanaka (39 papers)
  2. Kenjiro Taura (9 papers)
  3. Toshihiro Hanawa (3 papers)
  4. Kentaro Torisawa (1 paper)
Citations (17)

Summary

Automatic Graph Partitioning for Very Large-scale Deep Learning

The paper, Automatic Graph Partitioning for Very Large-scale Deep Learning, introduces RaNNC (Rapid Neural Network Connector), a middleware solution designed to facilitate model partitioning for large neural networks. With the escalating size of models such as T5 and GPT-3, seamless partitioning is imperative to ensure these models fit within the memory constraints of GPU devices and achieve efficient training throughput. RaNNC proposes automatic hybrid parallelism that harmoniously combines model and data parallelisms. Unlike prevailing methods that necessitate manual modification of model description, RaNNC performs automatic partitioning, thereby mitigating human intervention.

Partitioning Approach

RaNNC's partitioning strategy unfolds through three distinct phases: atomic-level, block-level, and stage-level partitioning. Initially, the task graph of a given model is broken down into atomic subcomponents using heuristic rules. These atomic subcomponents serve as fine-grained building blocks for partitioning. Subsequently, the block-level partitioning coalesces atomic subcomponents into more substantial blocks, balancing computational times and reducing communication costs. This coalescing leverages a k-way multilevel partitioning algorithm adapted from approaches used in load balancing tasks. The final stage-level partitioning phase involves a dynamic programming algorithm that determines optimal combinations of blocks to form stages, ensuring a balanced workload across the utilized computational devices.

Key Findings

RaNNC was experimentally compared to Megatron-LM, GPipe, and PipeDream-2BW frameworks using neural network models scaled up from BERT and ResNet architectures. The experiments highlight RaNNC's capability to partition and train models significantly larger than those manageable by Megatron-LM. Specifically, it successfully handled BERT models up to five times larger than those supported by Megatron-LM while maintaining comparable training throughput. Additionally, RaNNC outperformed GPipe in terms of throughput across different configurations for both BERT and ResNet models. Notably, RaNNC manages this without needing users to preemptively alter model descriptions for partitioning, unlike Megatron-LM and GPipe.

Implications and Future Directions

The paper underscores the importance of automatic model partitioning as neural networks continue their trend of increased complexity and parameter counts. RaNNC provides a testament to this, streamlining the partitioning process without human effort while optimizing for computation and memory efficiency. Furthermore, the avoidance of manual partitioning eliminates error-prone and labor-intensive processes, paving the way for more general applicability across diverse model architectures and tasks.

The results imply that RaNNC can democratize the usage of ultra-large neural networks by making them accessible to researchers and practitioners without specialized partitioning expertise. Practically, this could lead to a proliferation of large-scale model deployment in domains ranging from natural language processing to complex computer vision tasks.

In future developments, RaNNc may be extended to accommodate emerging neural architectures and potentially integrate more sophisticated heuristics and algorithms to optimize performance further. Such developments will be crucial as the demand for higher model capacities meets the constraints of existing hardware, demanding innovative solutions in model partitioning and parallel computation.

In summary, RaNNC is positioned as a vital middleware solution that simplifies model scaling through effective partitioning strategies, facilitating more extensive and efficient neural network training. Its automatic approach and performance outcomes set a precedent for future research and practical applications in large-scale deep learning.

Youtube Logo Streamline Icon: https://streamlinehq.com