Automatic Gradient Descent: Deep Learning without Hyperparameters (2304.05187v1)

Published 11 Apr 2023 in cs.LG, cs.AI, cs.NA, cs.NE, math.NA, and stat.ML

Abstract: The architecture of a deep neural network is defined explicitly in terms of the number of layers, the width of each layer and the general network topology. Existing optimisation frameworks neglect this information in favour of implicit architectural information (e.g. second-order methods) or architecture-agnostic distance functions (e.g. mirror descent). Meanwhile, the most popular optimiser in practice, Adam, is based on heuristics. This paper builds a new framework for deriving optimisation algorithms that explicitly leverage neural architecture. The theory extends mirror descent to non-convex composite objective functions: the idea is to transform a Bregman divergence to account for the non-linear structure of neural architecture. Working through the details for deep fully-connected networks yields automatic gradient descent: a first-order optimiser without any hyperparameters. Automatic gradient descent trains both fully-connected and convolutional networks out-of-the-box and at ImageNet scale. A PyTorch implementation is available at https://github.com/jxbz/agd and also in Appendix B. Overall, the paper supplies a rigorous theoretical foundation for a next-generation of architecture-dependent optimisers that work automatically and without hyperparameters.

Citations (17)

View on Semantic Scholar

Summary

The paper introduces AGD as a novel optimizer that leverages neural architecture to eliminate hyperparameter tuning.
It extends mirror descent using a Bregman divergence framework to handle non-convex composite objectives in neural networks.
AGD demonstrates robust, scalable performance by training ResNet-50 on ImageNet to a 65.5% top-1 accuracy without conventional tuning.

Insights into Architecture-Aware Optimization: Automatic Gradient Descent

The paper "Automatic Gradient Descent: Deep Learning without Hyperparameters" presents a novel approach to optimization in deep learning that eschews traditional hyperparameter tuning by directly leveraging the architecture of neural networks. This work introduces a framework for deriving optimization algorithms that explicitly consider neural architecture, thereby addressing the limitations of existing methods that rely on implicit or architecture-agnostic principles.

Theoretical Foundation

The authors extend the concept of mirror descent to accommodate non-convex composite objective functions specific to neural networks. By transforming a Bregman divergence to account for the nonlinear structure within these networks, the authors propose "automatic gradient descent" (AGD), a first-order optimizer that operates without hyperparameters. This offers a distinctive advantage over widely used optimizers, such as Adam, which are heuristic-based and often require extensive hyperparameter tuning.

Key Contributions

One of the paper’s central contributions is the derivation of AGD through a systematic application of the majorise-minimise meta-algorithm. The innovation here lies in the integration of two key tools:

Bregman Divergence: Used to characterize the interaction between the neural network and the loss function, further allowing for a functional expansion of the composite objective function.
Deep Relative Trust: A tool elucidating the non-linear interactions between neural weights and outputs, it underpins an architecture-dependent majorization.

The theoretical framework emphasizes architectural perturbation bounds and functional majorization, offering a pathway to a truly architecture-aware optimizer.

Numerical Results

The numerical results highlight AGD's robustness across different network architectures and scales. For instance, AGD successfully trains networks where the default settings of Adam and SGD fail. Furthermore, it scales efficiently to larger datasets such as ImageNet and shows competitive performance without the need for hyperparameter tuning. Notably, AGD trains ResNet-50 on ImageNet to a top-1 test accuracy of 65.5%, showcasing its efficacy at scale.

Implications and Future Prospects

Practically, AGD offers potential savings in computational resources and time, often spent in exhaustive hyperparameter search. This positions AGD as a valuable tool in enterprises with limited computational budgets, where efficiency is paramount. Theoretically, it challenges the community to rethink optimization strategies in deep learning through the lens of network architecture.

The implications of this research are profound, with potential applications extending beyond current experiments. It may influence future development in AI by promoting optimization that inherently adapts to varying network configurations. Moreover, this architecture-awareness could be a stepping stone toward more generalized automation in AI workflows.

Speculative Insights

Looking forward, extending AGD to broader neural architectures such as transformers could further solidify its standing. Integrating regularization techniques without introducing new hyperparameters presents an enticing challenge. Additionally, expanding upon operator perturbation theories could refine our understanding of perturbation in compound non-linear operators.

This paper is a significant step towards realizing optimizers that are inherently aligned with the structural nuances of deep learning models. By diminishing the reliance on hyperparameters, it not only simplifies the model training process but also contributes to the broader understanding of optimization dynamics in neural networks.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/jxbz/status/1811827920849269115

https://twitter.com/jxbz/status/1807693224057741813

https://twitter.com/cloneofsimo/status/1761274059709718908

https://twitter.com/tensecorrection/status/1791414748841689335