Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging (2209.14981v2)

Published 29 Sep 2022 in cs.LG, cs.AI, and stat.ML

Abstract: Training vision or LLMs on large datasets can take days, if not weeks. We show that averaging the weights of the k latest checkpoints, each collected at the end of an epoch, can speed up the training progression in terms of loss and accuracy by dozens of epochs, corresponding to time savings up to ~68 and ~30 GPU hours when training a ResNet50 on ImageNet and RoBERTa-Base model on WikiText-103, respectively. We also provide the code and model checkpoint trajectory to reproduce the results and facilitate research on reusing historical weights for faster convergence.

Citations (36)

View on Semantic Scholar

Summary

The paper proposes LAWA, a novel weight averaging method that accelerates convergence by averaging the latest k checkpoints during training.
LAWA achieved reductions of about 68 GPU hours for ResNet50 on ImageNet and 30 GPU hours for RoBERTa-Base, demonstrating significant time savings.
The approach integrates with existing training loops with minimal changes, offering a practical solution for faster deep learning model training without performance loss.

An Analysis of "Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging"

In the pursuit of optimizing deep learning models, the paper "Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging" presents LAtest Weight Averaging (LAWA), a method designed to accelerate the convergence of model training while simplifying integration into existing training processes. The research demonstrates substantial time savings in training prominent models like ResNet50 on ImageNet and RoBERTa-Base on WikiText-103 by utilizing weight averaging strategically during the middle phases of training.

Key Contributions and Methodology

The core contribution of this paper lies in revisiting the weight averaging strategy with a focus on improving convergence speed rather than solely boosting generalization. This approach diverges from traditional methodologies where weight averaging is employed predominantly at the end of training or post-convergence for generalization improvements. Instead, LAWA selectively averages the weights of the $k$ most recent checkpoints during the course of the training, rather than maintaining a cumulative moving average over an extended period.

This approach is grounded in the observation that large updates occur early in the training stage, while iterative weight updates become more stable and consistent in subsequent phases. By concentrating on averaging the latest weights during this stable middle phase, LAWA effectively reduces training time without necessitating significant alterations to existing training loops outside the introduction of a checkpoint queue, as illustrated in the provided pseudocode.

Experimental Findings

The empirical evaluations on the ImageNet 1000-class classification using ResNet50 and on masked language modeling with RoBERTa-Base highlight the efficacy of LAWA. In the ImageNet experiments, LAWA achieved a substantial reduction in training epochs required to reach a high validation accuracy, translating to a savings of approximately 68 GPU hours. Similarly, the RoBERTa-Base model training registered a decrease of around 30 GPU hours while achieving superior or comparable validation losses compared to the baseline.

Through various comparisons, it was illustrated that LAWA is relatively robust with respect to the parameter $k$ , typically set to 6 for optimal performance across diverse tasks. However, excessive averaging (i.e., high $k$ ) was shown to negatively impact performance, reinforcing the need for judicious selection of this hyper-parameter.

Implications and Future Directions

LAWA offers an accessible and efficient approach to expedite deep learning training, with significant potential to democratize research efforts by mitigating computational resource constraints. The implications extend beyond practical advantages, inspiring further exploration into dynamic checkpoint scheduling, integration with existing optimizer strategies like Sharpness-Aware Minimization (SAM), and parameter tuning within varied training environments.

The potential to resume training from a LAWA-averaged model state presents intriguing possibilities, albeit complicated by associated challenges such as calibration of learning rate schedules post-averaging. Future research might focus on refining such methodologies or exploring 'k' scheduling to further optimize LAWA's adaptive potential.

Simultaneously, establishing limitations and conditions where LAWA may fall short is equally vital, potentially yielding insights into its integration and compatibility across broader architectures and domains.

Conclusion

In conclusion, LAWA presents an innovative method to accelerate neural network training, fundamentally altering the trade-off between accuracy and training time within established architectures. By refining the implementation of weight averaging and adopting targeted strategies like LAWA, the deep learning community can achieve efficient model training without compromising performance, fostering an environment of rapid experimentation and agile research development.