Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

119 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

29 1

Understanding the Generalization Benefits of Late Learning Rate Decay (2401.11600v1)

Published 21 Jan 2024 in cs.LG and stat.ML

Abstract: Why do neural networks trained with large learning rates for a longer time often lead to better generalization? In this paper, we delve into this question by examining the relation between training and testing loss in neural networks. Through visualization of these losses, we note that the training trajectory with a large learning rate navigates through the minima manifold of the training loss, finally nearing the neighborhood of the testing loss minimum. Motivated by these findings, we introduce a nonlinear model whose loss landscapes mirror those observed for real neural networks. Upon investigating the training process using SGD on our model, we demonstrate that an extended phase with a large learning rate steers our model towards the minimum norm solution of the training loss, which may achieve near-optimal generalization, thereby affirming the empirically observed benefits of late learning rate decay.

References (78)

Authors (3)

Yinuo Ren (14 papers)
Chao Ma (187 papers)
Lexing Ying (159 papers)

Citations (4)

View on Semantic Scholar

Summary

Understanding the Generalization Benefits of Late Learning Rate Decay

The paper "Understanding the Generalization Benefits of Late Learning Rate Decay" authored by Yinuo Ren, Chao Ma, and Lexing Ying, provides a comprehensive analysis of the empirical observation that maintaining a high learning rate for an extended period during stochastic gradient descent (SGD) training can enhance the generalization performance of neural networks. The paper aims to bridge the gap between training and testing losses by proposing a novel nonlinear model that replicates the loss landscapes observed in practical neural networks.

Main Findings and Methodology

The authors focus on the mismatch between training and testing loss landscapes, particularly in overparameterized models where minimizing training loss does not inherently result in optimal generalization performance. Through experimental visualizations, the paper illustrates that training trajectories with a large learning rate initially navigate a broad manifold of minima in the training loss landscape before nearing the vicinity of the testing loss minimum. This observation provides the groundwork for exploring the benefits of late learning rate decay.

To explicate these behaviors, the paper introduces a nonlinear reparametrization model inspired by the layered structure of neural networks. The model captures the essential characteristics of real loss landscapes—such as the flat, open level sets of training loss manifolds versus the isolated minima of testing loss landscapes. This design highlights the role of network depth in shaping the curvature of loss surfaces, linking the observed flattening in training trajectories toward near-optimal testing performance.

Analysis of Training Phases

The authors decompose the training process under a large learning rate into three fundamental phases:

Phase I: Initial Large Learning Rate – Characterized by a trajectory aligning closely with the gradient flow due to the dominance of the actual gradient over noise, driving the solution toward the vicinity of the manifold of training loss minima.
Phase II: Extended Large Learning Rate – This phase is marked by the trajectory's exploration across the minima manifold. The mechanism for convergence toward a minimum norm solution of the training loss is unveiled by active participation of the implicit regularization effect induced by the SGD noise structure, often referred to as label noise.
Phase III: Decayed Learning Rate – Involving alignment with the gradient flow, facilitating rapid local convergence to the manifold, thereby finalizing the generalization efficacy observed with late learning rate decay.

A significant contribution lies in demonstrating that the parameter trajectory approaches the minimum L2-norm solution during Phase II, substantiated by the convergence characteristics inferred through continuous-time analysis. This implicitly regularizes the solution, potentially decoding the empirical generalization advantages noted in various tasks.

Implications and Future Directions

The implications of this research extend both theoretically and practically. The model elucidates the nuanced journey of SGD-driven training across loss landscapes, offering insights into the dynamic interplay between training accuracy and generalization capabilities. Furthermore, it offers an analytical lens to paper the phenomenon of late learning rate decay, paving the way for optimizing learning rate schedules in complex architectures.

Future work could focus on extending this analysis to discrete-time settings typically encountered in real-world SGD implementations. Additionally, investigations might explore the interactions between this training schema and other optimization techniques such as momentum methods and adaptive learning rates, potentially broadening the understanding of these observed phenomena and their impact on large-scale deep learning applications.

In conclusion, this paper significantly contributes to understanding why late learning rate decay aids neural networks in achieving superior generalization, leveraging a novel theoretical model to elucidate practical training behaviors observed in deep learning architectures.

PDF Markdown

Tweets

https://twitter.com/StatMLPapers/status/1749640363965362234

https://twitter.com/xidulu/status/1799456932618727911

YouTube

Show All Videos