Learning by Turning: Neural Architecture Aware Optimisation (2102.07227v2)

Published 14 Feb 2021 in cs.NE and cs.LG

Abstract: Descent methods for deep networks are notoriously capricious: they require careful tuning of step size, momentum and weight decay, and which method will work best on a new benchmark is a priori unclear. To address this problem, this paper conducts a combined study of neural architecture and optimisation, leading to a new optimiser called Nero: the neuronal rotator. Nero trains reliably without momentum or weight decay, works in situations where Adam and SGD fail, and requires little to no learning rate tuning. Also, Nero's memory footprint is ~ square root that of Adam or LAMB. Nero combines two ideas: (1) projected gradient descent over the space of balanced networks; (2) neuron-specific updates, where the step size sets the angle through which each neuron's hyperplane turns. The paper concludes by discussing how this geometric connection between architecture and optimisation may impact theories of generalisation in deep learning.

Citations (25)

View on Semantic Scholar

Summary

The paper introduces Nero, a neural optimisation algorithm that integrates geometric constraints via neuron-specific updates and projected gradient descent.
The paper shows Nero reduces memory usage to roughly the square root of Adam’s consumption without needing momentum or weight decay adjustments.
The paper demonstrates consistent performance across image classification, generation, NLP, and reinforcement learning, highlighting training stability and efficiency.

Overview of "Learning by Turning: Neural Architecture Aware Optimisation"

The paper "Learning by Turning: Neural Architecture Aware Optimisation" presents Nero, a novel neural network optimisation algorithm, designed with an intrinsic understanding of neural architecture. Unlike typical optimisation methods, which rely heavily on hyperparameter tuning and are often dependent on the neural architecture employed, Nero offers a more robust alternative by integrating architectural insights directly into the optimisation process.

Contributions and Findings

Nero Optimiser: The paper introduces Nero—the neuronal rotator—which reduces memory consumption to roughly the square root of that used by optimisers like Adam or LAMB. Unlike these popular methods, Nero does not require momentum or weight decay adjustments and functions effectively without extensive learning rate tuning.
Experimental Validation: The experiments span across image classification, image generation, natural language processing, and reinforcement learning tasks, demonstrating Nero’s effectiveness. The results consistently show that Nero either outperforms or competes strongly with carefully tuned baseline optimisers, even when used with its default settings.
Theoretical Insights: By connecting geometrical aspects of neural architecture to optimisation, the paper suggests potential enhancements in theories of generalisation. This geometrical approach, particularly the focus on neuron-specific updates and constraints—such as projecting onto a balanced network—underpins Nero's design and successes.

Technical Approach and Methodology

Nero's design amalgamates two core ideas:

Projected Gradient Descent: Conducted over the space of balanced networks, this approach ensures that neuron-specific conditions maintain stability and enable efficient training without the need for weight decay or intensive initialisation schemes.
Neuron-Specific Updates: By setting neuron-specific step sizes through rotational angles, Nero adapts each neuron's updates to its geometric constraints, enhancing resilience and reducing brittle behaviour during training.

Results and Implications

Numerical results from the paper reveal consistent performance advantages. For instance, Nero demonstrated robustness across a variety of applications from GAN training, where it maintained stronger Fréchet Inception Distance, to classical domains like CIFAR-10 image classification. The consistently near-optimal performance of Nero with minimal tuning highlights its potential to simplify model deployment across diverse datasets and neural architectures.

Implications for Future AI Developments

The implications of this work are multifaceted:

Training Stability: By diminishing dependency on hyperparameter tuning through architecture-aware design, Nero provides a paradigm for more stable and reliable training processes.
Resource Efficiency: With reduced memory usage and an innate ability to handle larger datasets or models (such as deep networks without conventional support like batch norm), Nero offers promising scalability for resource-intensive applications.
Theoretical Exploration: The paper’s geometric insights may further influence the development of generalisation theories, offering clearer understandings of model behaviour in high-dimensional weight spaces.

Future Directions

The paper opens pathways for future research, particularly in expanding architectural awareness in optimisation. Investigating more nuanced components of neural architectures could lead to even more tailored and effective optimisation strategies. Moreover, extending Nero’s framework to incorporate regularisation techniques could further align training objectives with desired generalisation properties.

Overall, the integration of architectural considerations into optimisation, as showcased by Nero, underscores a promising shift towards more informed and efficient neural network training methodologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jxbz/status/1795494502729175331

https://twitter.com/jxbz/status/1810403180485750882

YouTube

Show All Videos