Inversion dynamics of class manifolds in deep learning reveals tradeoffs underlying generalisation

Published 9 Mar 2023 in cs.LG | (2303.05161v2)

Abstract: To achieve near-zero training error in a classification problem, the layers of a feed-forward network have to disentangle the manifolds of data points with different labels, to facilitate the discrimination. However, excessive class separation can bring to overfitting since good generalisation requires learning invariant features, which involve some level of entanglement. We report on numerical experiments showing how the optimisation dynamics finds representations that balance these opposing tendencies with a non-monotonic trend. After a fast segregation phase, a slower rearrangement (conserved across data sets and architectures) increases the class entanglement.The training error at the inversion is stable under subsampling, and across network initialisations and optimisers, which characterises it as a property solely of the data structure and (very weakly) of the architecture. The inversion is the manifestation of tradeoffs elicited by well-defined and maximally stable elements of the training set, coined ``stragglers'', particularly influential for generalisation.

Abstract PDF HTML Upgrade to Chat

References (59)

Citations (6)

View on Semantic Scholar

Summary

The paper demonstrates that class manifolds experience a non-monotonic inversion, transitioning from rapid segregation to partial re-entanglement during training.
It identifies 'stragglers'—hard-to-learn outlier examples that crucially influence generalisation and mirror intrinsic data distribution complexity.
Empirical results show that removing stragglers alters manifold geometry and test performance, emphasizing their role as markers of the bias-variance tradeoff.

Inversion Dynamics of Class Manifolds and Generalisation Tradeoffs in Deep Learning

Introduction

This paper, "Inversion dynamics of class manifolds in deep learning reveals tradeoffs underlying generalisation" (2303.05161), investigates the geometric evolution of class manifolds during supervised deep learning and provides a quantitative framework for understanding how neural networks balance between class segregation (for discriminative power) and class entanglement (for invariance and generalisation). The authors conduct extensive numerical experiments, mainly with shallow fully-connected networks on MNIST and related datasets, to characterize the transient and asymptotic behavior of internal representations during training. They identify a non-monotonic dynamical regime, marked by an "inversion" point separating an initial segregation phase from a subsequent partial re-entanglement, and introduce the concept of "stragglers"—outlier examples whose late learning reflects and shapes the generalisation tradeoffs implemented by deep architectures.

Non-Monotonic Dynamics in Class Manifold Geometry

The authors analyze the temporal evolution of geometric quantities—specifically, the class radii $R_\pm(t)$ and the inter-class center distance $D(t)$ —by projecting the activations of the penultimate layer onto the unit sphere. Initial training rapidly decreases intra-class dispersion and increases inter-class separation (segregation). However, throughout continued training, an inversion occurs where class manifolds expand and move closer, reversing the earlier segregation trend. This inversion is robust to optimisation details (Adam, SGD, batch size, weight decay, momentum) and persists across network initialisations and subsets of the training set.

Figure 1: Non-monotonic segregation and expansion dynamics of class manifolds, with robustness to optimisers, data sub-sampling, and data label structure; randomization of labels abolishes non-monotonicity.

The inversion phase vanishes when class labels are randomized, demonstrating that data structure, rather than mere label assignment, governs the phenomenon.

Stragglers and the Tradeoff of Generalisation

A central contribution is the identification of "stragglers": those training examples that remain misclassified at the inversion epoch $t_*$ . The authors show that these examples are maximally stable across initialisations and largely invariant to architectural choices. Crucially, removing the stragglers from the dataset eliminates the inversion regime—the network exhibits monotonic segregation when trained on the pruned set, indicating that stragglers elicit the entanglement-expansion phase.

Figure 2: Pruning the stragglers from the dataset removes the inversion; stragglers are critical for test error and noise robustness.

Importantly, removing stragglers significantly worsens test generalisation, especially under moderate-to-low noise, but can improve robustness in extreme noise conditions. The subset of stragglers thus encodes aspects of the data distribution necessary for encoding invariances supporting generalisation—demonstrating a geometric realization of the bias-variance tradeoff.

Generality Across Datasets and Architectures

The non-monotonic dynamics and role of stragglers are validated for multiple datasets (KMNIST, Fashion-MNIST, CIFAR-10) and for deeper networks. The fraction of stragglers saturates to a dataset-dependent constant in the large-data regime (e.g., $\phi_\infty \approx 11\%$ for MNIST, $20\%$ for KMNIST), with only weak dependence on depth, width, or choice of nonlinearity (tanh, ReLU, leaky ReLU, SiLU, linear).

Figure 3: Universality of non-monotonic expansion, straggler fractions, and dataset/architecture dependence across MNIST, KMNIST, Fashion-MNIST, and CIFAR-10.

Stragglers are found to occupy the periphery of class manifolds, often near class boundaries, analogous to support vectors in SVMs, but defined by dynamical learning behavior. These observations are highly reproducible and not attributable to idiosyncratic features of particular architectures or initialisations.

Implications and Theoretical Relevance

This study provides strong evidence that generalisation in deep networks is determined not only by the usual bias-variance tension but also by the dynamical geometry of class manifolds as shaped by the interplay between easy-to-learn bulk and hard-to-learn outliers (stragglers). The inversion point quantifies the shift from rapid compression and linearly separable representations to a more invariant-encoded phase necessary for out-of-sample performance. The persistent nature of the fraction of stragglers and its weak dependence on hyperparameters suggests its potential utility as an intrinsic complexity marker for dataset-task pairs.

These insights have substantial implications:

Data Pruning and Curriculum Design: Since stragglers encode critical distributional structure, aggressive dataset pruning or misinformed curriculum policies may inadvertently eliminate these crucial examples, suppressing generalisation.
Understanding Memorization: The straggler phase demarcates the transition from feature learning to memorization.
Robustness and OOD Generalisation: Under severe input noise, removing stragglers can actually increase generalisation, opening the door for adaptive pruning strategies under known distribution shifts.
Theory: The robustness of the inversion and straggler phenomena motivates analytical study via deep linear network theory or information-theoretic approaches to understand the underlying mechanisms systematically.

Conclusion

The paper rigorously demonstrates that the geometric dynamics of class manifolds in deep learning exhibit a robust, data-structure-governed inversion separating a rapid initial segregation phase from a slower entanglement regime, the latter induced by a relatively small, highly conserved subset of "straggler" examples. These stragglers shape the network's ability to generalise by embedding invariant features, marking a dynamical trade-off that is intrinsic to the dataset and weakly dependent on hyperparameters or architecture. Future research could fruitfully probe the connections to information-theoretic regularisation, curriculum design, and the characteristics of challenging samples in the context of modern architectures and large-scale datasets.

(2303.05161)

Markdown