Papers
Topics
Authors
Recent
Search
2000 character limit reached

Inversion dynamics of class manifolds in deep learning reveals tradeoffs underlying generalisation

Published 9 Mar 2023 in cs.LG | (2303.05161v2)

Abstract: To achieve near-zero training error in a classification problem, the layers of a feed-forward network have to disentangle the manifolds of data points with different labels, to facilitate the discrimination. However, excessive class separation can bring to overfitting since good generalisation requires learning invariant features, which involve some level of entanglement. We report on numerical experiments showing how the optimisation dynamics finds representations that balance these opposing tendencies with a non-monotonic trend. After a fast segregation phase, a slower rearrangement (conserved across data sets and architectures) increases the class entanglement.The training error at the inversion is stable under subsampling, and across network initialisations and optimisers, which characterises it as a property solely of the data structure and (very weakly) of the architecture. The inversion is the manifestation of tradeoffs elicited by well-defined and maximally stable elements of the training set, coined ``stragglers'', particularly influential for generalisation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit. Nature Machine Intelligence, 5(12):1497–1507, 2023.
  2. Linear classification of neural manifolds with correlated variability. Phys. Rev. Lett., 131:027301, Jul 2023.
  3. How deep neural networks learn compositional data: The random hierarchy model, 2023.
  4. Activity–weight duality in feed-forward neural networks reveals two co-determinants for generalization. Nature Machine Intelligence, 5(8):908–918, 2023.
  5. Learning through atypical phase transitions in overparameterized neural networks. Phys. Rev. E, 106:014116, Jul 2022.
  6. Data-driven emergence of convolutional structure in neural networks. Proceedings of the National Academy of Sciences, 119(40):e2201854119, 2022.
  7. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020.
  8. Modeling the influence of data structure on learning in neural networks: The hidden manifold model. Phys. Rev. X, 10:041044, Dec 2020.
  9. Marc Mézard. Mean-field message-passing equations in the hopfield model and its generalizations. Phys. Rev. E, 95:022117, Feb 2017.
  10. Exploring generalization in deep learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 5949–5958, Red Hook, NY, USA, 2017. Curran Associates Inc.
  11. Understanding deep learning requires rethinking generalization. 2016. cite arxiv:1611.03530Comment: Published in ICLR 2017.
  12. Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior. arXiv:1710.09553 [cs.LG], 2017.
  13. Supervised contrastive learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18661–18673. Curran Associates, Inc., 2020.
  14. Semi-supervised learning via compact latent space clustering. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2459–2468. PMLR, 10–15 Jul 2018.
  15. Deep metric learning using triplet network. In Aasa Feragen, Marcello Pelillo, and Marco Loog, editors, Similarity-Based Pattern Recognition, pages 84–92, Cham, 2015. Springer International Publishing.
  16. Learning a nonlinear embedding by preserving class neighbourhood structure. In Marina Meila and Xiaotong Shen, editors, Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, volume 2 of Proceedings of Machine Learning Research, pages 412–419, San Juan, Puerto Rico, 21–24 Mar 2007. PMLR.
  17. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546 vol. 1, 2005.
  18. Quantifying the separability of data classes in neural networks. Neural Networks, 139:278–293, 2021.
  19. Classification and geometry of general perceptual manifolds. Phys. Rev. X, 8:031003, 2018.
  20. Motor cortex embeds muscle-like commands in an untangled population response. Neuron, 97(4), 2018.
  21. Optimal architectures in a solvable model of deep networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  22. Signals in inferotemporal and perirhinal cortex suggest an untangling of visual target information. Nature Neuroscience, 16:1132+, 2022/9/6/ 2013.
  23. Untangling invariant object recognition. Trends in Cognitive Sciences, 11(8):333–341, 2007.
  24. Gradient-based learning drives robust representations in recurrent neural networks by balancing compression and expansion. Nature Machine Intelligence, 4(6):564–573, 2022.
  25. Separability and geometry of object manifolds in deep neural networks. Nature Communications, 11(1):746, 2020.
  26. Intrinsic dimension of data representations in deep neural networks. In Advances in Neural Information Processing Systems 32, 2019.
  27. Recurrent neural networks learn robust representations by dynamically balancing compression and expansion. In Real Neurons & Hidden Units: Future directions at the intersection of neuroscience and artificial intelligence @ NeurIPS 2019, 2019.
  28. Dimensionality compression and expansion in deep neural networks, 06 2019.
  29. Exponential expressivity in deep neural networks through transient chaos. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  30. Analyzing and improving representations with the soft nearest neighbor loss. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2012–2020. PMLR, 09–15 Jun 2019.
  31. Where is the information in a deep neural network? CoRR, abs/1905.12213, 2019.
  32. Emergence of invariance and disentanglement in deep representations. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9, 2018.
  33. Opening the black box of deep neural networks via information. CoRR, abs/1703.00810, 2017.
  34. Yoshua Bengio. Deep learning of representations: Looking forward. In Adrian-Horia Dediu, Carlos Martín-Vide, Ruslan Mitkov, and Bianca Truthe, editors, Statistical Language and Speech Processing, pages 1–37, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
  35. Lenka Zdeborová. Understanding deep learning is also a job for physicists. Nature Physics, 16(6):602–604, May 2020.
  36. Marco Gherardi. Solvable model for the linear separability of structured data. Entropy, 23(3), 2021.
  37. Marc Mézard. Spin glass theory and its new challenge: structured disorder, 2023.
  38. Counting the learnable functions of geometrically structured data. Phys. Rev. Res., 2:023169, May 2020.
  39. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  40. Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, dec 2021.
  41. A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR, 2017.
  42. Exact solutions to the nonlinear dynamics of learning in deep linear neural network. In In International Conference on Learning Representations, 2014.
  43. Intrinsic dimension estimation for locally undersampled data. Scientific Reports, 9(1):17133, 2019.
  44. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Reports, 7(1):12140, 2017.
  45. Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations, 2018.
  46. Beyond the storage capacity: Data-driven satisfiability transition. Phys. Rev. Lett., 125:120601, Sep 2020.
  47. Statistical learning theory of structured data. Phys. Rev. E, 102:032119, 2020.
  48. Measuring logic complexity can guide pattern discovery in empirical systems. Complexity, 21(S2):397–408, 2016.
  49. Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, nov 2020.
  50. Statistics of shared components in complex component systems. Phys. Rev. X, 8:021023, 2018.
  51. Zipf and heaps laws from dependency structures in component systems. Phys. Rev. E, 98:012315, Jul 2018.
  52. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
  53. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
  54. MNIST handwritten digit database. 2010.
  55. Deep learning for classical japanese literature, 2018. cite arxiv:1812.01718Comment: To appear at Neural Information Processing Systems 2018 Workshop on Machine Learning for Creativity and Design.
  56. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
  57. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009.
  58. J.L. Cardy. Finite-size Scaling. Current physics. North-Holland, 1988.
  59. Marco Gherardi. Inversion dynamics of class manifolds in deep learning reveals tradeoffs underlying generalisation. DOI: 10.5281/zenodo.8355859, 2023.
Citations (6)

Summary

  • The paper demonstrates that class manifolds experience a non-monotonic inversion, transitioning from rapid segregation to partial re-entanglement during training.
  • It identifies 'stragglers'—hard-to-learn outlier examples that crucially influence generalisation and mirror intrinsic data distribution complexity.
  • Empirical results show that removing stragglers alters manifold geometry and test performance, emphasizing their role as markers of the bias-variance tradeoff.

Inversion Dynamics of Class Manifolds and Generalisation Tradeoffs in Deep Learning

Introduction

This paper, "Inversion dynamics of class manifolds in deep learning reveals tradeoffs underlying generalisation" (2303.05161), investigates the geometric evolution of class manifolds during supervised deep learning and provides a quantitative framework for understanding how neural networks balance between class segregation (for discriminative power) and class entanglement (for invariance and generalisation). The authors conduct extensive numerical experiments, mainly with shallow fully-connected networks on MNIST and related datasets, to characterize the transient and asymptotic behavior of internal representations during training. They identify a non-monotonic dynamical regime, marked by an "inversion" point separating an initial segregation phase from a subsequent partial re-entanglement, and introduce the concept of "stragglers"—outlier examples whose late learning reflects and shapes the generalisation tradeoffs implemented by deep architectures.

Non-Monotonic Dynamics in Class Manifold Geometry

The authors analyze the temporal evolution of geometric quantities—specifically, the class radii R±(t)R_\pm(t) and the inter-class center distance D(t)D(t)—by projecting the activations of the penultimate layer onto the unit sphere. Initial training rapidly decreases intra-class dispersion and increases inter-class separation (segregation). However, throughout continued training, an inversion occurs where class manifolds expand and move closer, reversing the earlier segregation trend. This inversion is robust to optimisation details (Adam, SGD, batch size, weight decay, momentum) and persists across network initialisations and subsets of the training set. Figure 1

Figure 1: Non-monotonic segregation and expansion dynamics of class manifolds, with robustness to optimisers, data sub-sampling, and data label structure; randomization of labels abolishes non-monotonicity.

The inversion phase vanishes when class labels are randomized, demonstrating that data structure, rather than mere label assignment, governs the phenomenon.

Stragglers and the Tradeoff of Generalisation

A central contribution is the identification of "stragglers": those training examples that remain misclassified at the inversion epoch t∗t_*. The authors show that these examples are maximally stable across initialisations and largely invariant to architectural choices. Crucially, removing the stragglers from the dataset eliminates the inversion regime—the network exhibits monotonic segregation when trained on the pruned set, indicating that stragglers elicit the entanglement-expansion phase. Figure 2

Figure 2: Pruning the stragglers from the dataset removes the inversion; stragglers are critical for test error and noise robustness.

Importantly, removing stragglers significantly worsens test generalisation, especially under moderate-to-low noise, but can improve robustness in extreme noise conditions. The subset of stragglers thus encodes aspects of the data distribution necessary for encoding invariances supporting generalisation—demonstrating a geometric realization of the bias-variance tradeoff.

Generality Across Datasets and Architectures

The non-monotonic dynamics and role of stragglers are validated for multiple datasets (KMNIST, Fashion-MNIST, CIFAR-10) and for deeper networks. The fraction of stragglers saturates to a dataset-dependent constant in the large-data regime (e.g., ϕ∞≈11%\phi_\infty \approx 11\% for MNIST, 20%20\% for KMNIST), with only weak dependence on depth, width, or choice of nonlinearity (tanh, ReLU, leaky ReLU, SiLU, linear). Figure 3

Figure 3: Universality of non-monotonic expansion, straggler fractions, and dataset/architecture dependence across MNIST, KMNIST, Fashion-MNIST, and CIFAR-10.

Stragglers are found to occupy the periphery of class manifolds, often near class boundaries, analogous to support vectors in SVMs, but defined by dynamical learning behavior. These observations are highly reproducible and not attributable to idiosyncratic features of particular architectures or initialisations.

Implications and Theoretical Relevance

This study provides strong evidence that generalisation in deep networks is determined not only by the usual bias-variance tension but also by the dynamical geometry of class manifolds as shaped by the interplay between easy-to-learn bulk and hard-to-learn outliers (stragglers). The inversion point quantifies the shift from rapid compression and linearly separable representations to a more invariant-encoded phase necessary for out-of-sample performance. The persistent nature of the fraction of stragglers and its weak dependence on hyperparameters suggests its potential utility as an intrinsic complexity marker for dataset-task pairs.

These insights have substantial implications:

  • Data Pruning and Curriculum Design: Since stragglers encode critical distributional structure, aggressive dataset pruning or misinformed curriculum policies may inadvertently eliminate these crucial examples, suppressing generalisation.
  • Understanding Memorization: The straggler phase demarcates the transition from feature learning to memorization.
  • Robustness and OOD Generalisation: Under severe input noise, removing stragglers can actually increase generalisation, opening the door for adaptive pruning strategies under known distribution shifts.
  • Theory: The robustness of the inversion and straggler phenomena motivates analytical study via deep linear network theory or information-theoretic approaches to understand the underlying mechanisms systematically.

Conclusion

The paper rigorously demonstrates that the geometric dynamics of class manifolds in deep learning exhibit a robust, data-structure-governed inversion separating a rapid initial segregation phase from a slower entanglement regime, the latter induced by a relatively small, highly conserved subset of "straggler" examples. These stragglers shape the network's ability to generalise by embedding invariant features, marking a dynamical trade-off that is intrinsic to the dataset and weakly dependent on hyperparameters or architecture. Future research could fruitfully probe the connections to information-theoretic regularisation, curriculum design, and the characteristics of challenging samples in the context of modern architectures and large-scale datasets.

(2303.05161)

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.