The Local Learning Coefficient: A Singularity-Aware Complexity Measure (2308.12108v2)

Published 23 Aug 2023 in stat.ML, cs.AI, and cs.LG

Abstract: The Local Learning Coefficient (LLC) is introduced as a novel complexity measure for deep neural networks (DNNs). Recognizing the limitations of traditional complexity measures, the LLC leverages Singular Learning Theory (SLT), which has long recognized the significance of singularities in the loss landscape geometry. This paper provides an extensive exploration of the LLC's theoretical underpinnings, offering both a clear definition and intuitive insights into its application. Moreover, we propose a new scalable estimator for the LLC, which is then effectively applied across diverse architectures including deep linear networks up to 100M parameters, ResNet image models, and transformer LLMs. Empirical evidence suggests that the LLC provides valuable insights into how training heuristics might influence the effective complexity of DNNs. Ultimately, the LLC emerges as a crucial tool for reconciling the apparent contradiction between deep learning's complexity and the principle of parsimony.

Citations (10)

View on Semantic Scholar

Summary

The paper introduces a scalable method using SGLD to compute the local learning coefficient, λ, in deep neural network models.
It validates theoretical predictions by showing that entropy-SGD finds flatter minima with lower λ values compared to traditional SGD.
The study links model degeneracy with generalization performance, offering actionable insights for model selection and optimization in deep learning.

Essay on "Quantifying Degeneracy in Singular Models via the Learning Coefficient"

The paper "Quantifying degeneracy in singular models via the learning coefficient" by Lau, Murfet, and Wei, extends the theoretical underpinnings of Singular Learning Theory (SLT) to address practical challenges in deep learning, focusing on quantifying the complex degeneracies that arise within models like deep neural networks (DNNs). It provides several robust theoretical and empirical insights into the nature of these degeneracies, primarily through the lens of the learning coefficient, λ, previously introduced in SLT.

Key Contributions and Methodological Advances

The authors begin by acknowledging a pervasive characteristic of most machine learning models, particularly neural networks: their singularity. Such models are defined by regions in parameter space where minute changes leave the model's output unaltered, existing as a non-negligible yet measure-zero subset that significantly affects learning dynamics. SLT describes this phenomenon, chiefly through two parameters — the learning coefficient (λ) and the singular fluctuation (ν), which capture the model complexity and functional diversity respectively.

A central figure in the paper's theoretical arsenal is the real log canonical threshold (RLCT), a birational invariant that provides a rigorous means of assessing the model's complexity via λ. Traditionally, determining λ and ν analytically has been confined to relatively simple, non-singular models. The authors address this limitation by introducing a computationally feasible method of approximating a localized λ using stochastic gradient Langevin dynamics (SGLD). This approach not only empirically validates the theoretical expectations of degeneracy in low-dimensional models, which have known λ values, but also adapts well to the high-dimensional, parameter-rich landscapes typical of deep neural networks.

Empirical Evidence

The paper presents empirical evidence using both synthetic data and real-world datasets like MNIST. The analyses in low-dimensional synthetic models — exploring closed-form KL divergence surfaces — reveal that the estimates of the local learning coefficient respect the known theoretical orderings. These empirical estimates are drawn through a Bayesian framework using SGLD to sample from tempered posteriors, set at specific inverse temperatures suggested to bridge computations for λ under the widely applicable Bayesian information criterion (WBIC).

Applying these techniques to deep neural networks, particularly those trained on MNIST, the authors delve into how varying optimization algorithms (SGD vs. entropy-SGD) present differing λ values across resulting solutions. Crucially, they demonstrate that entropy-SGD, designed to seek out more flat or degenerate minima, results in solutions with lower λ values compared to traditional SGD. This asserts the conjecture that degeneracy aligns with mechanisms behind generalization power in neural networks.

Theoretical and Practical Implications

The work engages with SLT's propositions about the asymptotic behavior of generalization error in the context of singularity-induced model degeneracy. It reinforces the importance of the learning coefficient (λ) in model selection and benchmarking, particularly when regularity assumptions are violated. The implications of correctly estimating λ surpass theoretical interest, hinting at broader impacts on understanding optimization-induced biases and the dynamics of critical points — posited as possibly intrinsic to the observed success of deep learning architectures.

Moreover, the paper suggests potential routes for future work, including leveraging λ and ν for detecting phase transitions in learning processes, which may elucidate emergent abilities in neural architectures. The recognition that degeneracy contributes critically to model behavior underscores the direction deep learning research might take, focusing on nuanced model evaluations that account for geometric singularities.

Conclusion

The paper positions the learning coefficient as a pivotal metric for appreciating and quantifying degeneracy within singular models, which include and reach beyond deep networks. By advancing a scalable, computational method for approximating λ, it opens avenues to reframe classical statistical perspectives on model complexity through the innovative paradigms of SLT, marking substantial progress in understanding and leveraging model singularities. This work, with its robust theoretical foundations and empirical reinforcement, represents a methodological step forward in examining machine learning models through the intricate lens of degeneracy and complexity.

PDF Markdown

Related Papers

Tweets

https://twitter.com/plain_simon/status/1868858656043024471

https://twitter.com/plain_simon/status/1771620505553891753

YouTube

Show All Videos