Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neural Networks Learn Statistics of Increasing Complexity (2402.04362v3)

Published 6 Feb 2024 in cs.LG

Abstract: The distributional simplicity bias (DSB) posits that neural networks learn low-order moments of the data distribution first, before moving on to higher-order correlations. In this work, we present compelling new evidence for the DSB by showing that networks automatically learn to perform well on maximum-entropy distributions whose low-order statistics match those of the training set early in training, then lose this ability later. We also extend the DSB to discrete domains by proving an equivalence between token $n$-gram frequencies and the moments of embedding vectors, and by finding empirical evidence for the bias in LLMs. Finally we use optimal transport methods to surgically edit the low-order statistics of one class to match those of another, and show that early-training networks treat the edited samples as if they were drawn from the target class. Code is available at https://github.com/EleutherAI/features-across-time.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Implicit regularization via neural feature alignment. In International Conference on Artificial Intelligence and Statistics, pp.  2269–2277. PMLR, 2021.
  2. Frequency bias in neural networks for input of non-uniform density. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  685–694. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/basri20a.html.
  3. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. doi: 10.1073/pnas.1903070116. URL https://www.pnas.org/doi/abs/10.1073/pnas.1903070116.
  4. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023.
  5. On the inductive bias of neural tangent kernels. Advances in Neural Information Processing Systems, 32, 2019.
  6. On the strong convergence of the optimal linear shrinkage estimator for large dimensional covariance matrix. Journal of Multivariate Analysis, 132:215–228, 2014.
  7. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications, 12(1):2914, 2021.
  8. Carpenter, B. Typical sets and the curse of dimensionality. Stan Software, 2017.
  9. Loss landscapes are all you need: Neural network generalization can be explained without the implicit bias of gradient descent. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=QC10RmRbZy9.
  10. The grammar-learning trajectories of neural language models. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8281–8297, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.568. URL https://aclanthology.org/2022.acl-long.568.
  11. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp.  702–703, 2020.
  12. The fréchet distance between multivariate normal distributions. Journal of multivariate analysis, 12(3):450–455, 1982.
  13. Maximum-entropy distributions having prescribed first and second moments (corresp.). IEEE Transactions on Information Theory, 19(5):689–693, 1973.
  14. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  15. Battle of the backbones: A large-scale comparison of pretrained models across computer vision tasks. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  16. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  17. Understanding generalization through visualizations. In Zosa Forde, J., Ruiz, F., Pradier, M. F., and Schein, A. (eds.), Proceedings on ”I Can’t Believe It’s Not Better!” at NeurIPS Workshops, volume 137 of Proceedings of Machine Learning Research, pp.  87–97. PMLR, 12 Dec 2020. URL https://proceedings.mlr.press/v137/huang20a.html.
  18. Neural tangent kernel: Convergence and generalization in neural networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
  19. Jaynes, E. T. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.
  20. Scaling laws for neural language models, 2020.
  21. Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 23(2):9–16, 1987.
  22. Learning multiple layers of features from tiny images. 2009.
  23. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  24. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1EA-M-0Z.
  25. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12009–12019, 2022.
  26. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  27. On the adequacy of untuned warmup for adaptive optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  8828–8836, 2021.
  28. Reading digits in natural images with unsupervised feature learning. 2011.
  29. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  30. Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10428–10436, 2020.
  31. On the spectral bias of neural networks. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  5301–5310. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/rahaman19a.html.
  32. Neural networks trained with sgd learn distributions of increasing complexity. In International Conference on Machine Learning, pp.  28843–28863. PMLR, 2023.
  33. Spreading vectors for similarity search. In International Conference on Learning Representations, 2018.
  34. Santambrogio, F. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.
  35. Sklar, M. Fonctions de répartition à n dimensions et leurs marges. In Annales de l’ISUP, volume 8, pp.  229–231, 1959.
  36. Deep learning generalizes because the parameter-function map is biased towards simple functions. In International Conference on Learning Representations, 2018.
  37. Linear and nonlinear extension of the pseudo-inverse solution for learning boolean functions. Europhysics Letters, 9(4):315, 1989.
  38. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16133–16142, 2023.
  39. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  40. Deep frequency principle towards understanding why deeper learning is faster. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12):10541–10550, May 2021. doi: 10.1609/aaai.v35i12.17261. URL https://ojs.aaai.org/index.php/AAAI/article/view/17261.
  41. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523, 2019a.
  42. Training behavior of deep neural network in frequency domain. In International Conference on Neural Information Processing, pp.  264–274, 2019b.
  43. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Nora Belrose (19 papers)
  2. Quintin Pope (2 papers)
  3. Lucia Quirke (5 papers)
  4. Alex Mallen (10 papers)
  5. Xiaoli Fern (9 papers)
Citations (7)

Summary

Introduction

Neural networks are known for their exceptional capability to adapt to highly complex data and generalize beyond the training set. This adaptability, intriguing particularly because of the networks’ capacity to fit even noisy or random labels, may have its roots in the distributional simplicity bias (DSB). This bias suggests that simpler patterns, or low-order moments of data, are learned first by neural networks before they capture more intricate, higher-order correlations. A new paper extends this concept further, exploring how DSB manifests across different data domains and during various phases of model training.

Theory and Methods

The researchers leverage a Taylor series expansion of the expected loss to assert a connection between the moments of data distribution and the expected loss experienced by the network. They posit that if a network reasonably approximates its expected loss through only the first few terms of this series, it would imply that the network is sensitive to the data's moments only up to that particular order. Two criteria are outlined to validate the model's reliance on data moments: transforming low-order statistics across classes must change the model’s classification correspondingly, and tampering with higher-order statistics should not deteriorate performance significantly.

Empirical Findings

This theory is then put to an empirical test across a range of network architectures and datasets, including modified image datasets to predominantly reflect first and second moments their statistical makeup. Key observations include that, early in training, network performance aligns closely with low-order statistics modifications. Sensitivity to these changes lessens as training progresses with models starting to interrelate with higher-order statistics gradually, thereby indicating a dynamic DSB throughout the learning process. Moreover, altering the low-order statistics of images from one class to match another led to the early-training networks classifying these edited images into the target class, essentially misclassifying, hence proving the acknowledged criteria.

Extension to Discrete Domains

A particularly intriguing aspect of the research is its application to discrete domains, namely language, unveiling an equivalence between token n-gram frequencies and the moments of embedding vectors. The paper examined LLMs and found the DSB with evidence of a "double descent" effect in performance. Initially, the models showed a U-shaped loss curve for learning n-gram frequencies, followed by a renewed decline in loss, suggesting in-context learning—a phenomenon where models use recent context to make predictions—leading to better performance later on.

Conclusion

Altogether, this research provides a robust empirical backing to the DSB conjecture while breaking new ground in understanding the statistical learning processes of neural networks. It uncovers sequential learning of statistical complexity in models and the interplay of such learning with model architectures and training time. These insights could potentially cultivate methodologies for a more deliberate shaping of learning trajectories in artificial intelligence systems, steering them towards desired generalization behaviors.