Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization (2310.01769v3)

Published 3 Oct 2023 in cs.LG, math.OC, and stat.ML

Abstract: This paper rigorously shows how over-parameterization changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown low-rank ground-truth matrix from near-isotropic linear measurements. First, we consider the symmetric setting with the symmetric parameterization where $M* \in \mathbb{R}{n \times n}$ is a positive semi-definite unknown matrix of rank $r \ll n$, and one uses a symmetric parameterization $XX\top$ to learn $M*$. Here $X \in \mathbb{R}{n \times k}$ with $k > r$ is the factor matrix. We give a novel $\Omega (1/T2)$ lower bound of randomly initialized GD for the over-parameterized case ($k >r$) where $T$ is the number of iterations. This is in stark contrast to the exact-parameterization scenario ($k=r$) where the convergence rate is $\exp (-\Omega (T))$. Next, we study asymmetric setting where $M* \in \mathbb{R}{n_1 \times n_2}$ is the unknown matrix of rank $r \ll \min{n_1,n_2}$, and one uses an asymmetric parameterization $FG\top$ to learn $M*$ where $F \in \mathbb{R}{n_1 \times k}$ and $G \in \mathbb{R}{n_2 \times k}$. Building on prior work, we give a global exact convergence result of randomly initialized GD for the exact-parameterization case ($k=r$) with an $\exp (-\Omega(T))$ rate. Furthermore, we give the first global exact convergence result for the over-parameterization case ($k>r$) with an $\exp(-\Omega(\alpha2 T))$ rate where $\alpha$ is the initialization scale. This linear convergence result in the over-parameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from $\Omega (1/T2)$ to linear convergence. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of $\alpha$, recovering the rate in the exact-parameterization case.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Nuoya Xiong (8 papers)
  2. Lijun Ding (28 papers)
  3. Simon S. Du (120 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.