Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Escaping Saddle Points with Adaptive Gradient Methods (1901.09149v2)

Published 26 Jan 2019 in cs.LG, math.OC, and stat.ML

Abstract: Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Matthew Staib (6 papers)
  2. Sashank J. Reddi (43 papers)
  3. Satyen Kale (50 papers)
  4. Sanjiv Kumar (123 papers)
  5. Suvrit Sra (124 papers)
Citations (72)

Summary

We haven't generated a summary for this paper yet.