Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding the Role of Optimization in Double Descent (2312.03951v1)

Published 6 Dec 2023 in cs.LG and stat.ML

Abstract: The phenomenon of model-wise double descent, where the test error peaks and then reduces as the model size increases, is an interesting topic that has attracted the attention of researchers due to the striking observed gap between theory and practice \citep{Belkin2018ReconcilingMM}. Additionally, while double descent has been observed in various tasks and architectures, the peak of double descent can sometimes be noticeably absent or diminished, even without explicit regularization, such as weight decay and early stopping. In this paper, we investigate this intriguing phenomenon from the optimization perspective and propose a simple optimization-based explanation for why double descent sometimes occurs weakly or not at all. To the best of our knowledge, we are the first to demonstrate that many disparate factors contributing to model-wise double descent (initialization, normalization, batch size, learning rate, optimization algorithm) are unified from the viewpoint of optimization: model-wise double descent is observed if and only if the optimizer can find a sufficiently low-loss minimum. These factors directly affect the condition number of the optimization problem or the optimizer and thus affect the final minimum found by the optimizer, reducing or increasing the height of the double descent peak. We conduct a series of controlled experiments on random feature models and two-layer neural networks under various optimization settings, demonstrating this optimization-based unified view. Our results suggest the following implication: Double descent is unlikely to be a problem for real-world machine learning setups. Additionally, our results help explain the gap between weak double descent peaks in practice and strong peaks observable in carefully designed setups.

Introduction to Model-wise Double Descent

The phenomenon of model-wise double descent in machine learning refers to the counterintuitive situation where the error rate of a predictive model first decreases, then increases, and finally decreases again as model complexity continues to rise beyond a certain point. This challenges classical theories in generalization behavior and has garnered significant interest.

Optimization's Impact on Double Descent

This research explores the phenomenon from the perspective of optimization. It suggests that factors often viewed as separate contributors—such as model initialization, learning rates, feature scaling, normalization, batch sizes, and the optimization algorithm used—are actually interrelated through optimization. These factors either directly or indirectly influence the 'condition number' of the optimization problem or optimizer. The condition number, reflecting the ratio of the largest to the smallest singular values of a feature matrix, plays a pivotal role by affecting how easy it is for the optimizer to find a low-loss minimum. Thus, it impacts the severity of the double descent curve's peak.

Empirical Observations and Implications for Real-World Application

The paper's experiments, using controlled setups on random feature models and two-layer neural networks with various optimization settings, demonstrate that double descent does not always manifest and is less likely to be a problem in practical applications. Real-world machine learning models are usually well-tuned with validation sets, and other regularizing techniques often circumvent double descent. Also, additional training iterations are typically needed for a strong double descent phenomenon to surface, which is not a common practice when models have already converged.

Exploring the Underlying Causes and Solutions

Further investigation shows that when a given training setup does not display double descent, allowing the training process to proceed much longer enables the peak to resurface. This indicates that the duration of the training process is a simple yet significant reason for the occurrence of double descent in certain settings. The comprehensive analysis strongly implies the importance of optimization nuances in understanding double descent and paves the way for future research to delve into theoretical explanations with new perspectives.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Chris Yuhao Liu (9 papers)
  2. Jeffrey Flanigan (18 papers)