Simplicity Bias of Transformers to Learn Low Sensitivity Functions (2403.06925v1)
Abstract: Transformers achieve state-of-the-art accuracy and robustness across many tasks, but an understanding of the inductive biases that they have and how those biases are different from other neural network architectures remains elusive. Various neural network architectures such as fully connected networks have been found to have a simplicity bias towards simple functions of the data; one version of this simplicity bias is a spectral bias to learn simple functions in the Fourier space. In this work, we identify the notion of sensitivity of the model to random changes in the input as a notion of simplicity bias which provides a unified metric to explain the simplicity and spectral bias of transformers across different data modalities. We show that transformers have lower sensitivity than alternative architectures, such as LSTMs, MLPs and CNNs, across both vision and language tasks. We also show that low-sensitivity bias correlates with improved robustness; furthermore, it can also be used as an efficient intervention to further improve the robustness of transformers.
- What learning algorithm is in-context learning? investigations with linear models. In The Eleventh International Conference on Learning Representations.
- A modern look at the relationship between sharpness and generalization. In International Conference on Machine Learning.
- Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, volume 32.
- A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR.
- Layer normalization.
- Transformers as statisticians: Provable in-context learning with in-context algorithm selection. arXiv preprint arXiv:2306.04637.
- Better plain vit baselines for imagenet-1k.
- Understanding in-context learning in transformers and llms by learning to learn discrete functions.
- Simplicity bias in transformers and their ability to learn sparse Boolean functions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.
- Understanding robustness of transformers for image classification. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10231–10241.
- Bishop, C. M. (1995). Training with noise is equivalent to tikhonov regularization. Neural Computation, 7:108–116.
- Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on learning theory, pages 483–513. PMLR.
- Towards understanding the word sensitivity of attention layers: A study via random features.
- Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Towards understanding the spectral bias of deep learning. In IJCAI.
- Chiang, T.-R. (2021). On a benefit of mask language modeling: Robustness to simplicity bias.
- Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305–1338. PMLR.
- Towards automated circuit discovery for mechanistic interpretability. In Neural Information Processing Systems.
- Autoaugment: Learning augmentation strategies from data. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 113–123.
- Gaussian process behaviour in wide deep neural networks.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
- Implicit bias in leaky relu networks trained on high-dimensional data. arXiv preprint arXiv:2210.07082.
- Transformers learn higher-order optimization methods for in-context learning: A study with linear models. arXiv, abs/2310.17086.
- What can transformers learn in-context? a case study of simple function classes. ArXiv, abs/2208.01066.
- Deep convolutional networks as shallow gaussian processes. In International Conference on Learning Representations.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673.
- Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations.
- Are vision transformers robust to spurious correlations?
- Annotation artifacts in natural language inference data. In Walker, M., Ji, H., and Stent, A., editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
- How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In Thirty-seventh Conference on Neural Information Processing Systems.
- Shape matters: Understanding the implicit bias of the noise covariance. In Conference on Learning Theory, pages 2315–2357. PMLR.
- How much does attention actually attend? questioning the importance of attention in pretrained transformers.
- Variations on the sensitivity conjecture. Theory of Computing, pages 1–27.
- Deep Residual Learning for Image Recognition. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’16, pages 770–778. IEEE.
- Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations.
- Long Short-Term Memory. Neural Computation, 9(8):1735–1780.
- Infinite attention: Nngp and ntk for deep attention networks. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org.
- Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269.
- Huang, H. (2019). Induced subgraphs of hypercubes and a proof of the sensitivity conjecture. Annals of Mathematics, 190(3):949–955.
- The low-rank simplicity bias in deep networks. Trans. Mach. Learn. Res., 2023.
- First Quora Dataset Release: Question Pairs. Online.
- What does BERT learn about the structure of language? In Korhonen, A., Traum, D., and Màrquez, L., editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics.
- Fast margin maximization via dual acceleration. In International Conference on Machine Learning, pages 4860–4869. PMLR.
- Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300.
- Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 33:17176–17186.
- Characterizing the implicit bias via a primal-dual analysis. In Algorithmic Learning Theory, pages 772–804. PMLR.
- Fantastic generalization measures and where to find them. In International Conference on Learning Representations.
- Highly accurate protein structure prediction with alphafold. Nature, 596:583 – 589.
- On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations.
- Last layer re-training is sufficient for robustness to spurious correlations. ArXiv, abs/2204.02937.
- Implicit bias of gradient descent for two-layer relu and leaky relu networks on nearly-orthogonal data.
- Revealing the dark secrets of BERT. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4365–4374, Hong Kong, China. Association for Computational Linguistics.
- Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. pages 32–33.
- Handwritten digit recognition with a back-propagation network. In Touretzky, D., editor, Advances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann.
- The mnist database of handwritten digits.
- Deep neural networks as gaussian processes. In International Conference on Learning Representations.
- Vision transformer for small-size datasets.
- Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning.
- Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent. Advances in Neural Information Processing Systems, 35:34626–34640.
- Fast autoaugment. In Neural Information Processing Systems.
- Roberta: A robustly optimized bert pretraining approach.
- Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations.
- Gradient descent on two-layer nets: Margin maximization and simplicity bias. In Neural Information Processing Systems.
- On the robustness of vision transformers to adversarial examples. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7838–7847.
- Melas-Kyriazi, L. (2021). Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet. ArXiv, abs/2105.02723.
- Locating and editing factual associations in GPT. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors, Advances in Neural Information Processing Systems.
- Simplicity bias in 1-hidden layer neural networks.
- The transformative power of transformers in protein structure prediction. Proceedings of the National Academy of Sciences, 120(32):e2303499120.
- Convergence of gradient descent on separable data. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3420–3428. PMLR.
- Understanding the failure modes of out-of-distribution generalization. In International Conference on Learning Representations.
- SGD on Neural Networks Learns Functions of Increasing Complexity. Curran Associates Inc., Red Hook, NY, USA.
- Progress measures for grokking via mechanistic interpretability.
- Intriguing properties of vision transformers. In Neural Information Processing Systems.
- Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.
- Exploring generalization in deep learning. In Neural Information Processing Systems.
- In search of the real inductive bias: On the role of implicit regularization in deep learning. CoRR, abs/1412.6614.
- Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760.
- Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations.
- O’Donnell, R. (2014). Analysis of Boolean Functions. Cambridge University Press.
- OpenAI (2023). GPT-4 technical report.
- Vision transformers are robust learners. In Proceedings of the AAAI conference on Artificial Intelligence, volume 36, pages 2071–2081.
- Gradient starvation: A learning proclivity in neural networks. In Neural Information Processing Systems.
- The inductive bias of re{lu} networks on orthogonally separable data. In International Conference on Learning Representations.
- Grokking: Generalization beyond overfitting on small algorithmic datasets.
- Do vision transformers see like convolutional neural networks? In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems.
- On the spectral bias of neural networks. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5301–5310. PMLR.
- Data augmentation can improve robustness. In Neural Information Processing Systems.
- Robbins, H. E. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407.
- Distributionally robust neural networks. In International Conference on Learning Representations.
- The pitfalls of simplicity bias in neural networks. Advances in Neural Information Processing Systems, 33.
- On the adversarial robustness of visual transformers. arXiv preprint arXiv:2103.15670, 1(2).
- Improving the robustness of transformer-based large language models with dynamic attention. arXiv preprint arXiv:2311.17400.
- Fast certified robust training with short warmup. Advances in Neural Information Processing Systems, 34:18335–18349.
- Robustness verification for transformers. arXiv preprint arXiv:2002.06622.
- The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878.
- Transformers as support vector machines. ArXiv, abs/2308.16898.
- Max-margin token selection in attention mechanism.
- Overcoming simplicity bias in deep networks using a feature sieve.
- Patches are all you need? Transactions on Machine Learning Research, 2023.
- Deep learning generalizes because the parameter-function map is biased towards simple functions. International Conference on Learning Representations.
- Vardi, G. (2022). On the implicit bias in deep-learning algorithms.
- Implicit bias and fast convergence rates for self-attention.
- Mitigating simplicity bias in deep learning for improved ood generalization and robustness.
- Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Transformers learn in-context by gradient descent. In International Conference on Machine Learning.
- Macrobert: Maximizing certified region of bert to adversarial word substitutions. In Database Systems for Advanced Applications: 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11–14, 2021, Proceedings, Part II 26, pages 253–261. Springer.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.
- Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523.
- Yang, G. (2021). Tensor programs i: Wide feedforward or recurrent neural networks of any architecture are gaussian processes.
- A fine-grained spectral perspective on neural networks.
- mixup: Beyond empirical risk minimization. In International Conference on Learning Representations.