How Spurious Features Are Memorized: Precise Analysis for Random and NTK Features
Abstract: Deep learning models are known to overfit and memorize spurious features in the training dataset. While numerous empirical studies have aimed at understanding this phenomenon, a rigorous theoretical framework to quantify it is still missing. In this paper, we consider spurious features that are uncorrelated with the learning task, and we provide a precise characterization of how they are memorized via two separate terms: (i) the stability of the model with respect to individual training samples, and (ii) the feature alignment between the spurious feature and the full sample. While the first term is well established in learning theory and it is connected to the generalization error in classical work, the second one is, to the best of our knowledge, novel. Our key technical result gives a precise characterization of the feature alignment for the two prototypical settings of random features (RF) and neural tangent kernel (NTK) regression. We prove that the memorization of spurious features weakens as the generalization capability increases and, through the analysis of the feature alignment, we unveil the role of the model and of its activation function. Numerical experiments show the predictive power of our theory on standard datasets (MNIST, CIFAR-10).
- Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security, page 308–318, 2016.
- Radoslaw Adamczak. A note on the Hanson-Wright inequality for random vectors with dependencies. Electronic Communications in Probability, 20:1–13, 2015.
- The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning (ICML), 2020.
- A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning (ICML), 2019.
- Differentially private learning with adaptive clipping. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning (ICML), 2019.
- Deep learning: a statistical viewpoint. Acta numerica, 30:87–201, 2021.
- Private sampling: A noiseless approach for generating differentially private synthetic data. SIAM Journal on Mathematics of Data Science, 4(3):1082–1115, 2022.
- Privacy of synthetic data: A statistical framework. IEEE Transactions on Information Theory, 69(1):520–527, 2023.
- Towards differential relational privacy and its use in question answering. arXiv preprint arXiv:2203.16701, 2022.
- Memorization and optimization in deep neural networks with minimum over-parameterization. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Beyond the universal law of robustness: Sharper laws for random features and neural tangent kernels. arXiv preprint arXiv:2302.01629, 2023.
- Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
- A universal law of robustness via isoperimetry. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Conference on Security Symposium, page 267–284, 2019.
- Extracting training data from large language models. In USENIX Conference on Security Symposium, 2021.
- On the (non-)robustness of two-layer neural networks in different learning regimes. arXiv preprint arXiv:2203.11864, 2022.
- Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning (ICML), 2019.
- The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
- Leave-one-out error and stability of learning algorithms with applications stability of randomized learning algorithms source. International Journal of Systems Science (IJSySc), 6, 2002.
- Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Vitaly Feldman. Does learning require memorization? A short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
- What neural networks memorize and why: Discovering the long tail via influence estimation. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Algorithmically effective differentially private synthetic data. arXiv preprint arXiv:2302.05552, 2023.
- Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
- Charles R. Johnson. Matrix Theory and Applications. American Mathematical Society, 1990.
- Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. In Proceedings of the Tenth Annual Conference on Computational Learning Theory, page 152–162, 1997.
- Private measures, random walks, and synthetic data. arXiv preprint arXiv:2204.09167, 2022.
- The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
- Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE Symposium on Security and Privacy, pages 739–753, 2019.
- The interpolation phase transition in neural networks: Memorization and generalization under lazy training. The Annals of Statistics, 50(5):2816–2847, 2022.
- Learning theory: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv. Comput. Math., 25:161–193, 2006.
- Machine learning with membership privacy using adversarial regularization. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, 2018.
- The loss surface of deep and wide neural networks. In International Conference on Machine Learning (ICML), 2017.
- Optimization landscape and expressivity of deep CNNs. In International Conference on Machine Learning (ICML), 2018.
- Global convergence of deep networks with one wide layer followed by pyramidal topology. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep ReLU networks. In International Conference on Machine Learning (ICML), 2021.
- Ryan O’Donnell. Analysis of Boolean Functions. Cambridge University Press, 2014.
- Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, 2017.
- Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NIPS), 2007.
- The curious case of hallucinations in neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1172–1183, 2021.
- Jssai Schur. Bemerkungen zur theorie der beschränkten bilinearformen mit unendlich vielen veränderlichen. Journal für die reine und angewandte Mathematik (Crelles Journal), 1911(140):1–28, 1911.
- Random matrix theory proves that deep learning representations of GAN-data behave as gaussian mixtures. In International Conference on Machine Learning (ICML), 2020.
- Membership inference attacks against machine learning models. In IEEE Symposium on Security and Privacy (SP), pages 3–18, 2017.
- Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2018.
- Differentially private learning needs better features (or much more data). In International Conference on Learning Representations (ICLR), 2021.
- Joel Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, page 389–434, 2012.
- Roman Vershynin. High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018.
- Deformed semicircle law and concentration of nonlinear random matrices for ultra-wide neural networks. arXiv preprint arXiv:2109.09304, 2021.
- Large scale private learning via low-rank reparametrization. In International Conference on Machine Learning (ICML), 2021.
- Reducing the frequency of hallucinated quantities in abstractive summaries. In Findings of the Association for Computational Linguistics (EMNLP), 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.