Automatic Graph Partitioning for Very Large-scale Deep Learning
The paper, Automatic Graph Partitioning for Very Large-scale Deep Learning, introduces RaNNC (Rapid Neural Network Connector), a middleware solution designed to facilitate model partitioning for large neural networks. With the escalating size of models such as T5 and GPT-3, seamless partitioning is imperative to ensure these models fit within the memory constraints of GPU devices and achieve efficient training throughput. RaNNC proposes automatic hybrid parallelism that harmoniously combines model and data parallelisms. Unlike prevailing methods that necessitate manual modification of model description, RaNNC performs automatic partitioning, thereby mitigating human intervention.
Partitioning Approach
RaNNC's partitioning strategy unfolds through three distinct phases: atomic-level, block-level, and stage-level partitioning. Initially, the task graph of a given model is broken down into atomic subcomponents using heuristic rules. These atomic subcomponents serve as fine-grained building blocks for partitioning. Subsequently, the block-level partitioning coalesces atomic subcomponents into more substantial blocks, balancing computational times and reducing communication costs. This coalescing leverages a k-way multilevel partitioning algorithm adapted from approaches used in load balancing tasks. The final stage-level partitioning phase involves a dynamic programming algorithm that determines optimal combinations of blocks to form stages, ensuring a balanced workload across the utilized computational devices.
Key Findings
RaNNC was experimentally compared to Megatron-LM, GPipe, and PipeDream-2BW frameworks using neural network models scaled up from BERT and ResNet architectures. The experiments highlight RaNNC's capability to partition and train models significantly larger than those manageable by Megatron-LM. Specifically, it successfully handled BERT models up to five times larger than those supported by Megatron-LM while maintaining comparable training throughput. Additionally, RaNNC outperformed GPipe in terms of throughput across different configurations for both BERT and ResNet models. Notably, RaNNC manages this without needing users to preemptively alter model descriptions for partitioning, unlike Megatron-LM and GPipe.
Implications and Future Directions
The paper underscores the importance of automatic model partitioning as neural networks continue their trend of increased complexity and parameter counts. RaNNC provides a testament to this, streamlining the partitioning process without human effort while optimizing for computation and memory efficiency. Furthermore, the avoidance of manual partitioning eliminates error-prone and labor-intensive processes, paving the way for more general applicability across diverse model architectures and tasks.
The results imply that RaNNC can democratize the usage of ultra-large neural networks by making them accessible to researchers and practitioners without specialized partitioning expertise. Practically, this could lead to a proliferation of large-scale model deployment in domains ranging from natural language processing to complex computer vision tasks.
In future developments, RaNNc may be extended to accommodate emerging neural architectures and potentially integrate more sophisticated heuristics and algorithms to optimize performance further. Such developments will be crucial as the demand for higher model capacities meets the constraints of existing hardware, demanding innovative solutions in model partitioning and parallel computation.
In summary, RaNNC is positioned as a vital middleware solution that simplifies model scaling through effective partitioning strategies, facilitating more extensive and efficient neural network training. Its automatic approach and performance outcomes set a precedent for future research and practical applications in large-scale deep learning.