Learned Optimizers that Scale and Generalize (1703.04813v4)

Published 14 Mar 2017 in cs.LG, cs.NE, and stat.ML

Abstract: Learning to learn has emerged as an important direction for achieving artificial intelligence. Two of the primary barriers to its adoption are an inability to scale to larger problems and a limited ability to generalize to new tasks. We introduce a learned gradient descent optimizer that generalizes well to new tasks, and which has significantly reduced memory and computation overhead. We achieve this by introducing a novel hierarchical RNN architecture, with minimal per-parameter overhead, augmented with additional architectural features that mirror the known structure of optimization tasks. We also develop a meta-training ensemble of small, diverse optimization tasks capturing common properties of loss landscapes. The optimizer learns to outperform RMSProp/ADAM on problems in this corpus. More importantly, it performs comparably or better when applied to small convolutional neural networks, despite seeing no neural networks in its meta-training set. Finally, it generalizes to train Inception V3 and ResNet V2 architectures on the ImageNet dataset for thousands of steps, optimization problems that are of a vastly different scale than those it was trained on. We release an open source implementation of the meta-training algorithm.

Authors (7)

Olga Wichrowska (1 paper)
Niru Maheswaranathan (19 papers)
Matthew W. Hoffman (14 papers)
Sergio Gomez Colmenarejo (24 papers)
Misha Denil (36 papers)
Nando de Freitas (98 papers)
Jascha Sohl-Dickstein (88 papers)

Citations (277)

View on Semantic Scholar

Summary

The paper introduces a hierarchical RNN architecture that reduces computational overhead and effectively captures inter-parameter dependencies for scalable optimization.
The paper incorporates dynamic features such as momentum and multi-timescale schemes to mitigate noise and promote robust convergence.
The paper demonstrates that the learned optimizer generalizes to unseen tasks, outperforming traditional optimizers on benchmarks like ImageNet and ResNet models.

Learned Optimizers that Scale and Generalize: An In-depth Analysis

This paper introduces a novel approach to the development of learned optimizers, which is a critical element in advancing the field of artificial intelligence via the paradigm of learning to learn. The proposed learned gradient descent optimizer attempts to overcome two primary limitations faced by existing approaches: scalability to large problems and generalization to new tasks. The authors present an innovative hierarchical RNN architecture designed to reduce memory and computational overhead while retaining the ability to generalize to new tasks far different from those encountered during training.

Key Contributions

Hierarchical RNN Architecture: The proposed optimizer utilizes a multi-level RNN structure consisting of Parameter, Tensor, and Global RNNs. This design minimizes per-parameter overhead and effectively captures inter-parameter dependencies, which are crucial for addressing the curvature of loss surfaces. The architecture ensures low computational cost while maintaining efficient communication across parameters and layers.
Task-Specific Features: Inspired by optimization literature, the authors incorporate several features into the RNN. These include a cross between Nesterov momentum and attention mechanisms, dynamic input scaling similar to what is utilized in algorithms like RMSProp and ADAM, and multi-timescale momentum to manage high-frequency oscillations and noise in stochastic gradients.
Comprehensive Meta-Training Ensemble: To ensure broad applicability, the authors curated an ensemble of small, diverse optimization tasks, capturing common loss landscape attributes and scenarios encountered in practical applications. This meta-training set played a critical role in training the RNN optimizer to generalize beyond typical problems seen during training.
Improved Meta-Training Procedure: The authors focus on refining the meta-optimization process. They propose a meta-objective based on the average log loss that encourages precise convergence and dynamically adjusts the learning rate. The meta-training further employs a heavy-tailed distribution over optimization steps to enhance the optimizer's potential for generalization to long training tasks.

Experimental Insights

The experiments demonstrate that the learned optimizer exhibits superior performance compared to existing optimizers like ADAM and RMSProp on a suite of tasks from the meta-training corpus, which include various artificially designed functions that simulate optimization pathologies. More notably, the learned optimizer showcases comparable, if not improved, performance on new tasks not seen during training, such as small convolutional networks and large-scale datasets like ImageNet, training complex architectures such as Inception V3 and ResNet V2.

The robustness of the learned optimizer is also notable in its reduced sensitivity to initial learning rates, offering a significant advantage over traditional hand-tuned methods. Despite initial scaling issues regarding wall clock time for computation, the proposed learned optimizer demonstrates a computational overhead that becomes negligible with increased mini-batch sizes.

Theoretical and Practical Implications

The introduction of a learned optimizer capable of generalizing across problem scales and tasks offers substantial theoretical contributions to meta-learning. Practically, such optimizers can significantly alleviate the overhead associated with the heuristic tuning of hyperparameters, particularly learning rates, in deep learning models. They present an efficient architecture capable of capturing key features necessary for optimization tasks and can potentially adapt to a broad array of applications beyond standard neural network training.

Future Directions

This work opens multiple avenues for further research. Future explorations can address late-stage training stagnation observed in large-scale models, refine the communication mechanisms in hierarchical RNNs, and explore the integration of additional elements from optimization literature. Furthermore, investigating alternative architectures or training paradigms that further enhance the scalability and applicability of learned optimizers remains an intriguing prospect.

In summary, this paper articulates a refined approach to creating learned optimizers, emphasizing scalability and generalization, thus contributing significantly to the field of AI through an insightful application of meta-learning principles.

PDF Markdown