The Training Process of Many Deep Networks Explores the Same Low-Dimensional Manifold (2305.01604v3)

Published 2 May 2023 in cs.LG and cond-mat.dis-nn

Abstract: We develop information-geometric techniques to analyze the trajectories of the predictions of deep networks during training. By examining the underlying high-dimensional probabilistic models, we reveal that the training process explores an effectively low-dimensional manifold. Networks with a wide range of architectures, sizes, trained using different optimization methods, regularization techniques, data augmentation techniques, and weight initializations lie on the same manifold in the prediction space. We study the details of this manifold to find that networks with different architectures follow distinguishable trajectories but other factors have a minimal influence; larger networks train along a similar manifold as that of smaller networks, just faster; and networks initialized at very different parts of the prediction space converge to the solution along a similar manifold.

Citations (13)

View on Semantic Scholar

Summary

The paper reveals that deep network training converges to a shared low-dimensional manifold, with the top three dimensions explaining approximately 76% of the predictive variations.
It employs an information-geometric framework to embed high-dimensional prediction spaces into lower dimensions, effectively visualizing training dynamics.
The findings suggest that despite varied architectures and initializations, DNNs share common training trajectories, prompting a rethink of the complexity in loss landscape exploration.

Exploring the Low-Dimensional Manifolds of Deep Network Training

Observations on Training Manifolds

Recent findings indicate that deep neural networks (DNNs), regardless of their structural diversity and the wide range of training paradigms applied, traverse remarkably consistent and low-dimensional manifolds during the training process. By analytically dissecting the prediction space, which spans an $N(C-1)$ -dimensional probabilistic model for a dataset with $N$ samples and $C$ classes, this paper reveals that the manifold navigated by a variety of networks throughout training is effectively of low dimensionality.

Methodological Approach

This paper employs a novel information-geometric framework to examine the trajectories of DNN predictions over the course of training. It proposes a model where each neural network, irrespective of its architecture or the specifics of its training regimen, represents a point in a high-dimensional prediction space. The trajectories formed by these networks during training are then analyzed by embedding these high-dimensional probabilistic models into a lower-dimensional space, providing a detailed visualization of their evolution.

Key Results

The low-dimensional embeddings of trajectories for more than 150,000 different models, despite varying architectures, optimization algorithms, and other factors, reveal that a significant majority (~76%) of the differences between these models in prediction space can be explained by the top three dimensions of this embedding. This suggests a form of intrinsic simplicity in how DNNs learn, challenging the prevailing notion of the training process as an inherently high-dimensional wander through weight space.

The paper further explores the structure of the training manifold, revealing that:

Different architectures manifest distinguishable trajectories within this shared manifold, implying that architecture has a dominant influence on training dynamics.
Larger networks progress through similar trajectories as their smaller counterparts but do so at an accelerated pace, effectively traversing the shared manifold more rapidly.
Remarkably, neural networks initialized to vastly different parts of the prediction space still converge towards the shared low-dimensional manifold before reaching the end of their training.

Exploring Predictions on Test Data

When assessing model predictions on unseen test data, a comparable low-dimensional manifold is observed, underscoring a fundamental characteristic of deep learning models: their generalization behavior is also confined to a relatively low-dimensional subspace. Notably, the trajectories in the test prediction space are more distinctive among different architectures compared to those in the training prediction space, suggesting that the architecture has a significant influence not just on how models learn but also on how they generalize.

Further Analysis and Discussion

This research prompts a reassessment of the complex optimization landscape presumed to characterize DNN training. The concurrence of myriad network configurations on a low-dimensional manifold suggests a reassessment of the computational complexity of training DNNs.

Further, the investigation into various means of ensemble predictions – specifically, finding that the harmonic mean of model probabilities in an ensemble achieves slightly better test accuracy – opens new avenues for improving the generalization of deep learning models.

The consistent remergence of entirely distinct initializations onto a shared training manifold challenges existing theories around the depth of local minima and the supposed "roughness" of the loss landscape, suggesting instead that successful paths through the prediction space may be far fewer and more universally accessible than previously thought.

Conclusion

This paper's exploration into the low-dimensional manifolds of DNN training offers a fresh perspective on deep learning phenomena, proposing that the process is inherently more structured and potentially simplistic than commonly assumed. This paradigm shift calls for a reevaluation of prevailing theories in deep learning, particularly those concerning network training and generalization.