Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking Neural Network Training Algorithms (2306.07179v2)

Published 12 Jun 2023 in cs.LG and stat.ML

Abstract: Training algorithms, broadly construed, are an essential part of every deep learning pipeline. Training algorithm improvements that speed up training across a wide variety of workloads (e.g., better update rules, tuning protocols, learning rate schedules, or data selection schemes) could save time, save computational resources, and lead to better, more accurate, models. Unfortunately, as a community, we are currently unable to reliably identify training algorithm improvements, or even determine the state-of-the-art training algorithm. In this work, using concrete experiments, we argue that real progress in speeding up training requires new benchmarks that resolve three basic challenges faced by empirical comparisons of training algorithms: (1) how to decide when training is complete and precisely measure training time, (2) how to handle the sensitivity of measurements to exact workload details, and (3) how to fairly compare algorithms that require hyperparameter tuning. In order to address these challenges, we introduce a new, competitive, time-to-result benchmark using multiple workloads running on fixed hardware, the AlgoPerf: Training Algorithms benchmark. Our benchmark includes a set of workload variants that make it possible to detect benchmark submissions that are more robust to workload changes than current widely-used methods. Finally, we evaluate baseline submissions constructed using various optimizers that represent current practice, as well as other optimizers that have recently received attention in the literature. These baseline results collectively demonstrate the feasibility of our benchmark, show that non-trivial gaps between methods exist, and set a provisional state-of-the-art for future benchmark submissions to try and surpass.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (107)
  1. Disentangling Adaptive Gradient Methods from Learning Rates, 2020.
  2. LocoProp: Enhancing BackProp via Local loss optimization. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.
  3. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. In International Conference on Machine Learning (ICML), 2016.
  4. Scalable second order optimization for deep learning, 2020.
  5. Predicting the utility of search spaces for black-box optimization: a simple, budget-aware approach. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.
  6. Amortized proximal optimization. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  7. Benchmarking in Optimization: Best Practice and Open Issues, 2020.
  8. Relational inductive biases, deep learning, and graph networks, 2018.
  9. Revisiting ResNets: Improved Training and Scaling Strategies. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  10. Better plain ViT baselines for ImageNet-1k, 2022.
  11. Findings of the 2014 Workshop on Statistical Machine Translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, 2014.
  12. Findings of the 2015 Workshop on Statistical Machine Translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, 2015.
  13. Findings of the 2017 Conference on Machine Translation (WMT17). In Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers, 2017.
  14. Critical Hyper-Parameters: No Random, No Cry, 2017.
  15. Symbolic Discovery of Optimization Algorithms, 2023.
  16. Faster Neural Network Training with Data Echoing, 2019a.
  17. On Empirical Comparisons of Optimizers for Deep Learning, 2019b.
  18. Adaptive Gradient Methods at the Edge of Stability, 2022.
  19. DAWNBench: An end-to-end deep learning benchmark and competition. In ML System Workshop at Advances in Neural Information Processing Systems (NeurIPS), 2017.
  20. RandAugment: Practical data augmentation with no separate search. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  21. BackPACK: Packing more into Backprop. In International Conference on Learning Representations (ICLR), 2020.
  22. Language Modeling with Gated Convolutional Networks. In International Conference on Machine Learning (ICML), 2017.
  23. Scaling Vision Transformers to 22 Billion Parameters, 2023.
  24. ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  25. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL-HLT), 2019.
  26. Benchmarking optimization software with performance profiles. Mathematical Programming, 2002.
  27. Xuanyi Dong and Yi Yang. NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search. In International Conference on Learning Representations (ICLR), 2020.
  28. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR), 2021.
  29. Benchmarking Graph Neural Networks. In Journal of Machine Learning Research, 2023.
  30. An Introduction to the Bootstrap. Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, 1993.
  31. Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. Neural Networks, 2018.
  32. Bi-SimCut: A Simple Strategy for Boosting Neural Machine Translation. In Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL-HLT), 2022.
  33. TIMIT Acoustic-Phonetic Continuous Speech Corpus. Web download, 1993.
  34. A Loss Curvature Perspective on Training Instabilities of Deep Learning Models. In International Conference on Learning Representations (ICLR), 2021.
  35. Intriguing Properties of Transformer Training Instabilities. To Appear, 2023.
  36. Deep Learning Tuning Playbook, 2023. URL http://github.com/google/tuning_playbook. Version 1.0.
  37. Jraph: A library for graph neural networks in jax., 2020. URL http://github.com/deepmind/jraph.
  38. Practical Quasi-Newton Methods for Training Deep Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  39. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In International Conference on Machine Learning (ICML), 2006.
  40. Conformer: Convolution-augmented Transformer for Speech Recognition, 2020.
  41. Shampoo: Preconditioned Stochastic Tensor Optimization . In International Conference on Machine Learning (ICML), 2018.
  42. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016a.
  43. Identity Mappings in Deep Residual Networks. In Computer Vision – ECCV, 2016b.
  44. Flax: A neural network library and ecosystem for JAX, 2023. URL http://github.com/google/flax.
  45. Gaussian Error Linear Units (GELUs), 2016.
  46. Reducing the Dimensionality of Data with Neural Networks. Science, 2006.
  47. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  48. Open Graph Benchmark: Datasets for Machine Learning on Graphs. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  49. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (ICML), 2015.
  50. A Domain-Specific Supercomputer for Training Deep Neural Networks. Communications of the ACM, 2020.
  51. The Lipschitz Constant of Self-Attention. In International Conference on Machine Learning (ICML), 2021.
  52. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR), 2015.
  53. Self-normalizing neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  54. Advancing machine learning for MR image reconstruction with an open competition: Overview of the 2019 fastMRI challenge. Magnetic Resonance in Medicine, 2020.
  55. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  56. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
  57. Criteo A. I. Lab. Criteo 1TB Click Logs dataset. Web download, 2014.
  58. Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV, 2014.
  59. On the Variance of the Adaptive Learning Rate and Beyond. In International Conference on Learning Representations (ICLR), 2020.
  60. Aggregated Momentum: Stability Through Passive Damping. In International Conference on Learning Representations (ICLR), 2019.
  61. On the Adequacy of Untuned Warmup for Adaptive Optimization. In AAAI Conference on Artificial Intelligence, 2021.
  62. Mega: Moving Average Equipped Gated Attention. In International Conference on Learning Representations (ICLR), 2023.
  63. James Martens. Deep learning via Hessian-free optimization. In International Conference on Machine Learning (ICML), 2010.
  64. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. In International Conference on Machine Learning (ICML), 2015.
  65. MLPerf Training Benchmark. In Proceedings of Machine Learning and Systems, 2020.
  66. Using a thousand optimization tasks to learn hyperparameter search strategies, 2020.
  67. Benchopt: Reproducible, efficient and collaborative optimization benchmarks. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  68. A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes, 2021.
  69. Deep Learning Recommendation Model for Personalization and Recommendation Systems, 2019.
  70. Numerical Optimization. Springer Science, 1999.
  71. NVIDIA. DLRM for PyTorch, 2023. URL https://catalog.ngc.nvidia.com/orgs/nvidia/resources/dlrm_for_pytorch.
  72. Librispeech: An ASR corpus based on public domain audio books. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
  73. Bleu: a Method for Automatic Evaluation of Machine Translation. In Annual Meeting of the Association for Computational Linguistics (ACL), 2002.
  74. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proc. Interspeech, 2019.
  75. Matt Post. A Call for Clarity in Reporting BLEU Scores. In Conference on Machine Translation: Research Papers, 2018.
  76. i-RIM applied to the fastMRI challenge, 2019.
  77. Do ImageNet Classifiers Generalize to ImageNet? In International Conference on Machine Learning (ICML), 2019.
  78. Yi Ren and Donald Goldfarb. Tensor normal training for deep learning models. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  79. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, 2015.
  80. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.
  81. Descending through a Crowded Valley – Benchmarking Deep Learning Optimizers. In International Conference on Machine Learning (ICML), 2021.
  82. DeepOBS: A Deep Learning Optimizer Benchmark Suite. In International Conference on Learning Representations (ICLR), 2019.
  83. Measuring the Effects of Data Parallelism on Neural Network Training. Journal of Machine Learning Research, 2019.
  84. Optimizer Benchmarking Needs to Acccount for Hyperparameter Tuning. In International Conference on Machine Learning (ICML), 2020.
  85. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 2014.
  86. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers. Transactions on Machine Learning Research, 2022.
  87. Andreas Sterbenz. Using Google Cloud Machine Learning to predict clicks at scale, 2017. URL https://cloud.google.com/blog/products/gcp/using-google-cloud-machine-learning-to-predict-clicks-at-scale.
  88. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In International Conference on Machine Learning (ICML), 2019.
  89. Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale, 2022.
  90. Attention is All you Need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  91. Pooling Architecture Search for Graph Property Prediction in Open Graph Benchmark. Technical report, AutoGraph team, 2022.
  92. Technical Report for OGB Graph Property Prediction. Technical report, Tencent Youtu Lab, 2021.
  93. ResNet strikes back: An improved training procedure in timm, 2021.
  94. The Marginal Value of Adaptive Gradient Methods in Machine Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  95. MoleculeNet: a benchmark for molecular machine learning. Chemical science, 2018.
  96. Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models, 2022.
  97. On Layer Normalization in the Transformer Architecture. In International Conference on Machine Learning (ICML), 2020.
  98. How Powerful are Graph Neural Networks? In International Conference on Learning Representations (ICLR), 2019.
  99. Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  100. Large Batch Training of Convolutional Networks, 2017.
  101. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. In International Conference on Learning Representations (ICLR), 2020.
  102. Wide Residual Networks, 2016.
  103. fastMRI: An Open Dataset and Benchmarks for Accelerated MRI, 2018.
  104. mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations (ICLR), 2018.
  105. RePAST: A ReRAM-based PIM Accelerator for Second-order Training of DNN, 2022.
  106. AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  107. TBD: Benchmarking and Analyzing Deep Neural Network Training. In IEEE International Symposium on Workload Characterization (IISWC), 2018.
Citations (21)

Summary

  • The paper proposes a time-to-result benchmark that objectively evaluates neural network training algorithms across diverse workloads.
  • It presents extensive empirical results using eight baseline algorithms, highlighting the role of hyperparameter tuning and robust scoring methods.
  • The study emphasizes standardized hardware setups and rigorous methodologies to enhance reproducibility and guide efficient algorithm selection.

Benchmarking Neural Network Training Algorithms

The paper, "Benchmarking Neural Network Training Algorithms," authored by George E. Dahl et al., addresses a critical gap in the deep learning research community: a standardized, competitive benchmark for neural network training algorithms. Training algorithms are integral to the efficacy and efficiency of deep learning models, and yet, the lack of a uniform benchmark has hampered the ability to perform rigorous, reproducible comparisons. This work introduces a time-to-result benchmark that aims to objectively evaluate and compare training algorithms on a suite of diverse workloads, using fixed hardware.

Overview of Contributions

The paper makes several substantive contributions:

  1. Challenges of benchmarking training algorithms: The paper identifies three major challenges: the lack of precise metrics for measuring training speed, the sensitivity of algorithm performance to specific workload details, and the difficulties in comparing algorithms with different tuning needs. Addressing these challenges, the authors argue, requires a standardized benchmark.
  2. New benchmark introduction: The authors propose a new benchmark that includes multiple, diverse workloads to reflect various deep learning applications. Each workload provides a specific model, dataset, and loss function. Workloads are divided into fixed workloads and randomized variants, with the latter designed to detect robustness in algorithm performance.
  3. Performance profiles and scoring methodology: The benchmark employs performance profiles to compare training speed across workloads. Submissions are scored based on the median of several trials, focusing on both validation and test set performance to ensure practical relevance. The scoring system is carefully crafted to balance robustness and speed.
  4. Extensive baseline results: The paper presents detailed empirical results for eight baseline training algorithms, underscoring the importance of hyperparameter tuning and search spaces. The results provide a preliminary state of the art and showcase non-trivial performance gaps between algorithms, demonstrating the need for the proposed benchmark.

Methodological Rigor and Considerations

The methodology is rigorous and well-documented. Specifically:

  • Target-setting: The authors articulate a systematic procedure for setting validation and test targets based on the best achievable performance within a designated runtime. Multiple hyperparameters were tuned using quasirandom search to identify competitive baselines.
  • Handling sensitive details: Workload variants were carefully designed to be representative of natural changes that might occur in practice. These variants help deter overfitting to specific workloads.
  • Standardizing hardware: To ensure fair comparisons, the benchmark uses a standardized hardware configuration (8 x 16GB VRAM GPUs), circumventing issues of system performance variability.

The paper further discusses important considerations including the necessity of explicit hyperparameter tuning protocols and the inherent workload sensitivity of different algorithms. The authors argue that recommendations often seen in the literature are insufficient and that optimizers should provide guidance for various budget scenarios.

Implications and Future Developments

Practical Implications:

  • The introduction of this benchmark enables the community to make more informed decisions about which training algorithms may be most effective for specific applications.
  • By standardizing performance evaluation, the benchmark removes a significant barrier to reproducible research.
  • The benchmark can help practitioners save computational resources by highlighting which training algorithms are more efficient.

Theoretical Implications:

  • This work paves the way for more principled studies on the interaction between model architectures and optimizers.
  • It could stimulate interest in understanding the implicit regularization effects of different training algorithms.
  • The benchmark facilitates a better understanding of the trade-offs between training speed and final model performance.

Future Developments:

  • Including new workloads that reflect emerging application domains (e.g., LLMs, video understanding).
  • Introducing support for different hardware configurations to expand the benchmark’s applicability.
  • Extending the benchmark to incorporate self-supervised and unsupervised learning tasks could offer further insights.

In summary, this paper presents a significant step towards formalizing and standardizing the evaluation of neural network training algorithms. By addressing fundamental challenges, proposing a new benchmark framework, and demonstrating its utility through extensive empirical results, the authors provide a vital resource for accelerating progress in the field of deep learning.

Youtube Logo Streamline Icon: https://streamlinehq.com