Principled Architecture-aware Scaling of Hyperparameters (2402.17440v1)
Abstract: Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process. Current works try to automatically optimize or design principles of hyperparameters, such that they can generalize to diverse unseen scenarios. However, most designs or optimization methods are agnostic to the choice of network structures, and thus largely ignore the impact of neural architectures on hyperparameters. In this work, we precisely characterize the dependence of initializations and maximal learning rates on the network architecture, which includes the network depth, width, convolutional kernel size, and connectivity patterns. By pursuing every parameter to be maximally updated with the same mean squared change in pre-activations, we can generalize our initialization and learning rates across MLPs (multi-layer perception) and CNNs (convolutional neural network) with sophisticated graph topologies. We verify our principles with comprehensive experiments. More importantly, our strategy further sheds light on advancing current benchmarks for architecture design. A fair comparison of AutoML algorithms requires accurate network rankings. However, we demonstrate that network rankings can be easily changed by better training networks in benchmarks with our architecture-aware learning rates and initialization.
- Rezero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pp. 1352–1361. PMLR, 2021.
- Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, pp. 115–123. PMLR, 2013.
- Pasha: Efficient hpo with progressive resource allocation. arXiv preprint arXiv:2207.06940, 2022.
- Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective. arXiv preprint arXiv:2102.11535, 2021.
- Towards learning universal hyperparameter optimizers with transformers. arXiv preprint arXiv:2205.13320, 2022.
- Fairnas: Rethinking evaluation fairness of weight sharing neural architecture search. In Proceedings of the IEEE/CVF International Conference on computer vision, pp. 12239–12248, 2021.
- A scaling calculus for the design and initialization of relu networks. Neural Computing and Applications, 34(17):14807–14821, 2022.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Effective theory of transformers at initialization. arXiv preprint arXiv:2304.02034, 2023.
- Xuanyi Dong and Yi Yang. Nas-bench-102: Extending the scope of reproducible neural architecture search. arXiv preprint arXiv:2001.00326, 2020.
- Nats-bench: Benchmarking nas algorithms for architecture topology and size. arXiv preprint arXiv:2009.00437, 2020a.
- Autohas: Efficient hyperparameter and architecture search. arXiv preprint arXiv:2006.03656, 2020b.
- Automated deep learning: Neural architecture search is not the end. arXiv preprint arXiv:2112.09245, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Is one epoch all you need for multi-fidelity hyperparameter optimization? arXiv preprint arXiv:2307.15422, 2023.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. JMLR Workshop and Conference Proceedings, 2010.
- Single path one-shot neural architecture search with uniform sampling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pp. 544–560. Springer, 2020.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in neural information processing systems, 30, 2017.
- Hyperparameter transfer learning with adaptive complexity. In International Conference on Artificial Intelligence and Statistics, pp. 1378–1386. PMLR, 2021.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
- Improving transformer optimization through better initialization. In International Conference on Machine Learning, pp. 4475–4483. PMLR, 2020.
- Hyperparameter transfer learning through surrogate alignment for efficient deep neural network training. arXiv preprint arXiv:1608.00218, 2016.
- Maximal initial learning rates in deep relu networks. arXiv preprint arXiv:2212.07295, 2022.
- Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.
- Non-stochastic best arm identification and hyperparameter optimization. In Artificial intelligence and statistics, pp. 240–248. PMLR, 2016.
- Depth dependence of µp learning rates in relu mlps. arXiv preprint arXiv:2305.07810, 2023.
- Tabular benchmarks for joint architecture and hyperparameter optimization. arXiv preprint arXiv:1905.04970, 2019.
- A system for massively parallel hyperparameter tuning. Proceedings of Machine Learning and Systems, 2:230–246, 2020.
- Transfer learning based search space design for hyperparameter tuning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 967–977, 2022a.
- Transbo: Hyperparameter optimization via two-phase transfer learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 956–966, 2022b.
- Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
- Understanding the difficulty of training transformers. arXiv preprint arXiv:2004.08249, 2020.
- Neural architecture search without training. In International Conference on Machine Learning, pp. 7588–7598. PMLR, 2021.
- The effect of network width on stochastic gradient descent and generalization: an empirical study. In International Conference on Machine Learning, pp. 5042–5051. PMLR, 2019.
- Scalable hyperparameter transfer learning. Advances in neural information processing systems, 31, 2018.
- Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
- Deep information propagation. arXiv preprint arXiv:1611.01232, 2016.
- Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600, 2018.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
- Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012.
- Hyperparameter transfer across developer adjustments. arXiv preprint arXiv:2010.13117, 2020.
- Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.
- Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828, 2019.
- Exploring randomly wired neural networks for image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1284–1293, 2019.
- Sho Yaida. Meta-principled family of hyperparameter scaling strategies. arXiv preprint arXiv:2210.04909, 2022.
- Nas evaluation is frustratingly hard. arXiv preprint arXiv:1912.12522, 2019.
- Ge Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances in neural information processing systems, pp. 7103–7114, 2017.
- Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522, 2020.
- Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
- Nas-bench-101: Towards reproducible neural architecture search. In International Conference on Machine Learning, pp. 7105–7114. PMLR, 2019.
- Efficient transfer learning method for automatic hyperparameter tuning. In Artificial intelligence and statistics, pp. 1077–1085. PMLR, 2014.
- Graph structure of neural networks. arXiv preprint arXiv:2007.06559, 2020.
- Towards automated deep learning: Efficient joint neural architecture and hyperparameter search. arXiv preprint arXiv:1807.06906, 2018.
- Residual learning without normalization via better initialization. In International Conference on Learning Representations, volume 3, pp.  2, 2019.
- Gradinit: Learning to initialize neural networks for stable and efficient training. Advances in Neural Information Processing Systems, 34:16410–16422, 2021.
- Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.