Papers
Topics
Authors
Recent
2000 character limit reached

Implicit Bias of Gradient Descent

Updated 23 November 2025
  • Implicit bias of gradient descent is the phenomenon where optimization dynamics select specific global minimizers in overparameterized models without explicit regularization.
  • In linear models, gradient descent directs iterates toward the hard-margin SVM solution, revealing its power in achieving margin maximization and inducing low-rank structures.
  • Recent advances extend this bias to deep and non-homogeneous networks, matrix factorizations, and adversarial setups, highlighting its impact on generalization and robustness.

The implicit bias of gradient descent refers to the phenomenon whereby gradient-based optimization selects specific solutions among the many global minimizers of non-regularized, highly overparameterized models. This bias arises exclusively from the optimization dynamics, in the absence of explicit regularization. In modern machine learning, especially in deep and overparameterized networks, the implicit bias of gradient descent plays a decisive role in generalization, solution structure, and inductive properties of neural networks. Recent theoretical advances precisely characterize these biases across various architectures, problem settings, and loss functions, revealing the mechanisms by which gradient descent preferentially selects solutions with distinct margin, norm, rank, or structural properties.

1. Implicit Bias in Linear Models and Classical Results

The prototypical setting illustrating implicit bias is unregularized logistic or exponential-loss regression on linearly separable data. In this context, gradient descent (GD) does not converge to a finite minimizer—instead, the norm of the parameter vector diverges, but its direction converges to the hard-margin (ℓ₂) support vector machine (SVM) solution. The precise result is that for data {(xi,yi)}i=1nRd×{±1}\{(x_i,y_i)\}_{i=1}^n \subset \mathbb R^d \times \{\pm1\}, under mild conditions, GD iterates satisfy: wt=w^logt+ρ(t),ρ(t)=O(1),w^=argminww22 s.t. yiwxi1 iw_t = \hat w \log t + \rho(t), \qquad \|\rho(t)\| = O(1), \qquad \hat w = \arg\min_w \|w\|_2^2 \text{ s.t. } y_i w^\top x_i \ge 1~\forall i with the iterates aligning directionally: limtwtwt=w^w^\lim_{t\to\infty} \frac{w_t}{\|w_t\|} = \frac{\hat w}{\|\hat w\|} This result is robust across a wide class of monotonic, strictly decreasing loss functions with an exponential tail, is agnostic to initialization, and holds for sufficiently small step sizes (Soudry et al., 2017). A primal-dual mirror-descent analysis further provides an exact characterization of the limiting direction and establishes tight convergence rates to the maximal margin direction (Ji et al., 2019).

In the multiclass case, for permutation-equivariant, relative-margin (PERM) losses with an exponential tail (including softmax cross-entropy), a similar max-margin bias holds: GD selects the unique solution to the multiclass hard-margin SVM, with the direction of the parameter matrix WW given by the unique minimizer of

minWk=1Kwk2subject tomi(W)1 i\min_W \sum_{k=1}^K \|w_k\|^2 \quad \text{subject to} \quad m_i(W)\ge1~\forall i

for appropriate multiclass margin mi(W)m_i(W) (Ravi et al., 2 Nov 2024). This framework precisely bridges the binary and multiclass settings.

2. Extension to Deep, Homogeneous, and Non-Homogeneous Networks

For deep networks, the implicit bias is strongly influenced by both the parameterization and the network architecture. In exactly homogeneous networks—such as fully connected or convolutional linear nets of depth LL—gradient descent or gradient flow on exponential-type losses has been shown to select maximal margin solutions under non-Euclidean or "quasi-norm" penalties imposed by the network parametrization (Yun et al., 2020, Gunasekar et al., 2018). Specifically:

  • In fully connected linear networks of depth LL, gradient flow selects the solution with maximal ℓ₂-margin, independent of depth: the implicit regularization is invariant to depth.
  • In depth-LL (full-width) linear convolutional networks, gradient descent with exponential-tailed loss yields a predictor whose Fourier coefficients minimize an ℓ_{2/L} “bridge”-quasinorm subject to margin constraints:

w^=argminw^w^2/L2/Ls.t. yiw^,x^i1\hat w = \arg\min_{\hat w} \|\hat w\|_{2/L}^{2/L} \quad \text{s.t. } y_i \Re\langle \hat w, \hat x_i\rangle \ge 1

with deep networks (L>2L>2) biasing towards sparser solutions in the frequency domain (Gunasekar et al., 2018, Yun et al., 2020).

For generalized gated linear networks (GLNs), the implicit bias is dictated by the group-lasso (GLN-norm) structure induced by the architecture, and is captured by a convex optimization over the squared norm of the (shared) parameters subject to margin constraints per context (Lippl et al., 2022). This rigorously bridges architectural constraints and optimization bias.

Recent results extend the theory to non-homogeneous deep networks, incorporating architectures with residual connections or non-homogeneous activation functions. For such models exhibiting "near-homogeneity" (controlled deviation from exact scaling), gradient descent on the exponential loss drives the model parameters to infinite norm, yet the normalized parameters converge directionally to the unique solution of a margin-maximization problem involving the "homogenized" network. Three structural properties are established: nearly monotonic increase in normalized margin, divergence of parameter norm with convergent direction, and limiting satisfaction of KKT conditions for a margin maximization problem with respect to the homogenized network (Cai et al., 22 Feb 2025). This resolves the open problem of implicit bias for wide classes of non-homogeneous architectures.

3. Implicit Bias in Wide Two-Layer and ReLU Networks

Wide two-layer neural networks with homogeneous activations and infinite width, trained to zero loss on exponentially-tailed losses (e.g., logistic), exhibit a novel implicit bias: gradient flow converges to the maximal margin solution in the non-Hilbertian variation-norm function space F1F_1: F1={f:RdRf(x)=Sp1ϕ(θ,x)dν(θ),νvar<}F_1 = \left\{f:\mathbb R^d\rightarrow \mathbb R\,|\, f(x)=\int_{S^{p-1}}\phi(\theta,x) d\nu(\theta),\, \|\nu\|_{\mathrm{var}} < \infty\right\} with the implicit F₁-norm

fF1=inf{ν(Sp1)f(x)=ϕ(θ,x)dν(θ),ν0}\|f\|_{F_1} = \inf \{\nu(S^{p-1})\,|\,f(x)=\int\phi(\theta,x)d\nu(\theta), \nu\ge0\}

and limiting classifier given by the F₁-max-margin direction (Chizat et al., 2020). Training only the output layer selects the reproducing kernel Hilbert space (RKHS) max-margin classifier, which lacks data-adaptive generalization. Empirical and theoretical analyses confirm that, in high-dimensional data with low-dimensional structure, the F₁-margin is independent of ambient dimension and leads to stronger generalization bounds than the RKHS solution.

For two-layer ReLU or leaky ReLU networks trained on high-dimensional, nearly-orthogonal data, gradient flow and discrete-time GD with small initialization and sufficient width induce a strong low-rank bias. For leaky ReLU (γ(0,1)\gamma\in(0,1)), the stable rank of the first-layer weights converges to 1, while for pure ReLU, it is bounded by a constant, reflecting rank collapse to minimal capacity compatible with margin maximization. All normalized margins across data points equalize in the limit, with the solution satisfying the KKT conditions of the corresponding max-margin problem in function space. This establishes a spectral regularization phenomenon that explains generalization and margin equalization across samples (Kou et al., 2023, Frei et al., 2022).

Function-space implicit bias under wide networks and mean squared error (MSE) regression is also fully characterized: the learned solution interpolates the data and minimizes a curvature-penalized functional, where with suitable initialization the solution coincides with the natural cubic or polyharmonic spline interpolant (Jin et al., 2020).

4. Matrix Factorization and Low-Rank Bias

The implicit bias of gradient descent in matrix factorization problems fundamentally departs from standard norm or nuclear-norm minimization. For depth-2 factorizations (X=UVX=UV^\top), gradient flow with infinitesimal initialization is provably equivalent, phase by phase, to a greedy low-rank learning algorithm that incrementally builds a low-rank solution by adding rank-1 directions aligned with the leading eigenvector of the negative gradient. This yields low-rank solutions that may not coincide with global nuclear norm minimizers, explicitly refuting prior conjectures (Li et al., 2020). In deeper factorizations (L3L\ge3), the greedy rank-minimization effect is even more pronounced due to weaker dependence on initialization scale, providing robust low-rank bias even with practical initializations.

Recent advances introduce a factorization model (X=UDVX=UDV^\top), employing hard constraints on U and V and growing D diagonally, yielding truly (exactly) low-rank solutions regardless of initialization and step size. This architecture extends the implicit bias to explicit realization in both matrix factorization and neural networks, resulting in architectures that learn efficient, low-rank representations with competitive performance (Hou et al., 27 Jan 2025).

5. Optimization Dynamics, Edge of Stability, and Noise

In typical deep learning regimes, gradient descent often operates on the edge of stability (EoS)—where step sizes are large and loss is non-monotonic. For logistic regression on separable data, constant step-size GD remains biased towards the max-margin SVM direction even at EoS, with the direction of the iterates diverging logarithmically along the margin direction and settling in the orthogonal complement to the unique minimizer of a convex potential over the support vectors (Wu et al., 2023). For the exponential loss, large step sizes induce catastrophic divergence, marking a fundamental difference between logistic and exponential losses.

At the edge of stability for general losses, the dynamics are governed by a self-stabilization mechanism: moving in the top-eigenvector direction of the Hessian causes cubic corrections in the local Taylor expansion that return the training trajectory back to stability. As a result, gradient descent implicitly performs projected gradient descent over the manifold where the Hessian sharpness is bounded by 2/η2/\eta (Damian et al., 2022). This enforces a stability-constrained implicit bias.

Stochastic optimization further introduces implicit noise-geometry through SGD mini-batch noise. Noisy-SGD and differentially private SGD amplify the bias, particularly in underparameterized and diagonal-linear networks, promoting solution shapes dependent on the stochasticity structure. Pure gradient descent lacks this beneficial geometry, accounting for the generalization gap observed in large-batch or noise-free regimes (Sander et al., 13 Feb 2024).

6. Structural Inductive Bias and Adversarial Vulnerability

Recent results reveal fine-grained structural implicit biases in deep networks, especially under the simplest gradient descent (without architectural or loss modification). For data composed of multiple, mutually orthogonal, discriminative features clustered by label, GD-trained two-layer ReLU networks "average" features across clusters: hidden-layer weights align with the class-average of centers, not individual features. This feature averaging leads to provable adversarial non-robustness; adversarial directions aligned with the averaged features can flip the class decision with much smaller perturbations than the ambient norm, leading to poor ℓ₂-robustness (Li et al., 14 Oct 2024). Fine-grained (cluster-level) supervision, in contrast, enables the learned hidden features to decouple and align with individual discriminative directions, providing optimal robustness.

This structural bias is distinct from the "simplicity bias" observed in overparameterized networks trained with SGD: here, network solutions collapse to low-dimensional or sparse representations, sometimes at the expense of rich, Bayes-optimal boundaries. In various categorically designed tasks, adaptive optimizers (e.g., Adam) exhibit a richer bias, recovering higher-accuracy, non-collapsing solutions, whereas vanilla SGD-trained networks lean toward simpler, lower-dimensional classifiers (Vasudeva et al., 29 May 2025).


Key cited works:

(Soudry et al., 2017, Chizat et al., 2020, Li et al., 2020, Hou et al., 27 Jan 2025, Gunasekar et al., 2018, Yun et al., 2020, Lippl et al., 2022, Cai et al., 22 Feb 2025, Kou et al., 2023, Frei et al., 2022, Jin et al., 2020, Wu et al., 2023, Damian et al., 2022, Sander et al., 13 Feb 2024, Ravi et al., 2 Nov 2024, Li et al., 14 Oct 2024, Vasudeva et al., 29 May 2025)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Implicit Bias of Gradient Descent.