Understanding and correcting pathologies in the training of learned optimizers (1810.10180v5)

Published 24 Oct 2018 in cs.NE and stat.ML

Abstract: Deep learning has shown that learned functions can dramatically outperform hand-designed functions on perceptual tasks. Analogously, this suggests that learned optimizers may similarly outperform current hand-designed optimizers, especially for specific problems. However, learned optimizers are notoriously difficult to train and have yet to demonstrate wall-clock speedups over hand-designed optimizers, and thus are rarely used in practice. Typically, learned optimizers are trained by truncated backpropagation through an unrolled optimization process resulting in gradients that are either strongly biased (for short truncations) or have exploding norm (for long truncations). In this work we propose a training scheme which overcomes both of these difficulties, by dynamically weighting two unbiased gradient estimators for a variational loss on optimizer performance, allowing us to train neural networks to perform optimization of a specific task faster than tuned first-order methods. We demonstrate these results on problems where our learned optimizer trains convolutional networks faster in wall-clock time compared to tuned first-order methods and with an improvement in test loss.

Citations (143)

View on Semantic Scholar

Summary

The paper identifies significant training issues, including biases from truncated backpropagation and exploding gradients during long unrolls.
The paper proposes a dynamic weighting scheme that combines reparameterization and evolutionary strategies to deliver unbiased and stable gradient estimates.
The paper demonstrates that learned optimizers can outperform fine-tuned first-order methods on convolutional networks in terms of training speed and test loss.

Overview of "Understanding and correcting pathologies in the training of learned optimizers"

The paper "Understanding and correcting pathologies in the training of learned optimizers," authored by Luke Metz et al., from Google Brain, explores the complexities associated with training learned optimizers. The paper reinforces the idea that learned optimization functions can potentially surpass hand-designed optimizers by demonstrating task-specific enhancements. However, the inherent difficulties in training these optimizers mean that they seldom provide practical advantages over contemporary first-order methods, such as SGD and ADAM, in wall-clock time scenarios.

Key Contributions

Training Challenges: The authors identify two significant hurdles in the training of learned optimizers—strong biases arising from truncated backpropagation and exploding gradient norms due to prolonged unrolls. These factors stunt the practical application of learned optimizers.
Proposed Solution: They introduce a dynamic weighting scheme for two unbiased gradient estimators tailored for a variational loss on optimizer performance. This solution mitigates both aforementioned issues, facilitating stable and efficient training of learned optimizers.
Empirical Results: Using convolutional networks as a test bed, the learned optimizers displayed notable superiority in training speed and test loss performance compared to fine-tuned first-order methods. Specifically, these optimizers achieve advancements in wall-clock time efficiencies.

Technical Approach

The researchers address the training impediments through several noteworthy strategies:

Unrolled Optimization: The learned optimizer training involves treating it as a bi-level optimization problem with distinct inner and outer loops. The weights of the target task undergo iterative updates, allowing the learned optimizer to fine-tune its parameters by backpropagating through unrolled optimization.
Gradient Estimation: Two gradient estimators, the reparameterization-based gradient and evolutionary strategies, serve as unbiased approaches to compute the gradient despite extended backpropagation unrolls. They ensure robust training amidst potential exploding gradients and biases.
Variational Optimization: The authors employ a variational optimization framework to smooth out outer-objective landscapes. Their estimator merges the two gradient types invoking inverse variance weighting of gradient estimates to minimize bias and variance simultaneously.

Experimental Setup

The experimental settings focus on convolutional neural networks trained on subsets of the Imagenet dataset. The optimizer architecture embraces feed-forward networks, emphasizing simplicity and computational efficiency. The methodology encompasses a curriculum-based approach in varying unroll lengths during training, enhancing the effectiveness against varying task conditions.

Implications and Future Outlook

The paper paves the way for robust trainable optimizers, crucial for domain-specific applications in machine learning. By overcoming fundamental training challenges, the learned optimizers hold potential for significant advancements in AI, where optimization efficiency is paramount.

Future developments could examine wider generalization capabilities across diverse tasks and variable inner-model architectures. There is a compelling interest to explore learned optimizers' transferability to different domain problems, potentially yielding new insights on problem-specific structures that existing hand-designed optimizers might not exploit.

In conclusion, the insights presented in the paper align closely with ongoing efforts to enhance machine learning process efficiency—highlighting the transformative potential of learned optimizers in improving specific task outcomes within computational frameworks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jaschasd/status/1757168172808196571