Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions (2211.12316v2)
Abstract: Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer's effective generalization performance despite relatively limited expressiveness.
- Tighter relations between sensitivity and other complexity measures. In International Colloquium on Automata, Languages, and Programming, pages 101–113. Springer.
- Exploring length generalization in large language models. arXiv preprint arXiv:2207.04901.
- A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 233–242. JMLR.org.
- Hidden progress in deep learning: Sgd learns parities near the computational limit.
- On the Ability and Limitations of Transformers to Recognize Formal Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7096–7116, Online. Association for Computational Linguistics.
- On the practical ability of recurrent neural networks to recognize hierarchical languages. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1481–1494, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- David Chiang and Peter Cholak. 2022. Overcoming a theoretical limitation of self-attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7654–7664, Dublin, Ireland. Association for Computational Linguistics.
- Neural networks and the chomsky hierarchy. arXiv preprint arXiv:2207.02098.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- How can self-attention networks recognize dyck-n languages? arXiv preprint arXiv:2010.04303.
- Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pages 5793–5831. PMLR.
- Leonardo Franco. 2006. Generalization ability of boolean functions implemented in feedforward neural networks. Neurocomputing, 70(1-3):351–361.
- Lstm recurrent networks learn simple context-free and context-sensitive languages. IEEE Transactions on Neural Networks, 12(6):1333–1340.
- Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings.
- Smooth boolean functions are easy: Efficient algorithms for low-sensitivity functions. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, pages 59–70.
- First-order versus second-order single-layer recurrent neural networks. IEEE Transactions on Neural Networks, 5(3):511–513.
- Michael Hahn. 2020. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171.
- Sensitivity as a complexity measure for sequence classification tasks. Transactions of the Association for Computational Linguistics, 9:891–908.
- Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics, 10:800–810.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- The influence of variables on Boolean functions. Citeseer.
- Michael Kearns. 1998. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- John F Kolen and Stefan C Kremer. 2001. A field guide to dynamical recurrent networks. John Wiley & Sons.
- Samuel A Korsky and Robert C Berwick. 2019. On the computational power of rnns. arXiv preprint arXiv:1906.06349.
- Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150.
- Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843–856.
- A formal hierarchy of rnn architectures. arXiv preprint arXiv:2004.08500.
- Sympy: symbolic computing in python. PeerJ Computer Science, 3:e103.
- Neural networks are a priori biased towards boolean functions with low entropy. arXiv preprint arXiv:1909.11522.
- Is sgd a bayesian sampler? well, almost. Journal of Machine Learning Research, 22.
- Foundations of machine learning. MIT press.
- SGD on Neural Networks Learns Functions of Increasing Complexity. Curran Associates Inc., Red Hook, NY, USA.
- Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations.
- Ryan O’Donnell. 2021. Analysis of boolean functions.
- Samet Oymak and Mahdi Soltanolkotabi. 2019. Overparameterized nonlinear learning: Gradient descent takes the shortest path? In International Conference on Machine Learning, pages 4951–4960. PMLR.
- Random deep neural networks are biased towards simple functions. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 1962–1974.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
- Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177.
- On the spectral bias of neural networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5301–5310. PMLR.
- Alexander Rush. 2018. The annotated transformer. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 52–60, Melbourne, Australia. Association for Computational Linguistics.
- Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. Advances in Neural Information Processing Systems, 34:6695–6706.
- Luzi Sennhauser and Robert Berwick. 2018. Evaluating the ability of LSTMs to learn context-free grammars. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 115–124, Brussels, Belgium. Association for Computational Linguistics.
- Closing brackets with recurrent neural networks. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 232–239, Brussels, Belgium. Association for Computational Linguistics.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- LSTM networks can perform dynamic counting. In Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges, pages 44–54, Florence. Association for Computational Linguistics.
- On evaluating the generalization of LSTM models in formal languages. In Proceedings of the Society for Computation in Linguistics (SCiL) 2019, pages 277–286.
- Memorisation versus generalisation in pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7564–7578, Dublin, Ireland. Association for Computational Linguistics.
- Deep learning generalizes because the parameter-function map is biased towards simple functions. In International Conference on Learning Representations.
- Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- A comparative study of rule extraction for recurrent neural networks. CoRR, abs/1801.05420v2.
- Shunjie Wang. 2021. Evaluating Transformer’s Ability to Learn Mildly Context-Sensitive Languages. University of Washington.
- On the practical computational power of finite precision RNNs for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 740–745, Melbourne, Australia. Association for Computational Linguistics.
- Andrew G Wilson and Pavel Izmailov. 2020. Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697–4708.
- Self-attention networks can process bounded hierarchical languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3770–3785, Online. Association for Computational Linguistics.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115.
- Satwik Bhattamishra (13 papers)
- Arkil Patel (14 papers)
- Varun Kanade (41 papers)
- Phil Blunsom (87 papers)