Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions (2211.12316v2)

Published 22 Nov 2022 in cs.LG and cs.CL

Abstract: Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer's effective generalization performance despite relatively limited expressiveness.

Simplicity Bias in Transformers and Their Ability to Learn Sparse Boolean Functions

The paper investigates the inherent inductive biases of Transformers, particularly their propensity toward learning functions with low sensitivity, and contrasts this with recurrent neural networks (RNNs) like LSTMs. The detailed empirical analysis aims to elucidate why Transformers exhibit superior practical performance in spite of theoretical limitations, especially concerning their expressiveness compared to recurrent models.

Key Findings

  1. Simplicity Bias of Random Transformers: The paper shows that randomly initialized Transformer models are more likely to represent low-sensitivity functions compared to LSTMs. This bias is observed regardless of whether weights are initialized uniformly or using other common strategies like Gaussian or Xavier initialization.
  2. Training Dynamics and Sensitivity: During training on Boolean functions, both Transformers and LSTMs tend to initially learn functions of lower sensitivity. However, after achieving near-zero training error, Transformers converge to solutions with significantly lower sensitivity than their recurrent counterparts.
  3. Robustness in Learning Sparse Boolean Functions: The paper finds that Transformers are notably effective at generalizing sparse Boolean functions, such as sparse parities, even in the face of noisy labels. In contrast, LSTMs tend to overfit, achieving perfect training accuracy while failing to generalize on test sets for these functions.
  4. Relationship Between Sensitivity and Other Complexity Measures: The paper correlates sensitivity with other complexity measures like Sum of Products (SOP) size and entropy. It concludes that sensitivity aligns well with these measures and can serve as a tractable estimate of function complexity.

Implications and Future Directions

The findings about Transformers' bias towards low-sensitivity functions suggest an alignment between their architecture and the nature of many practical tasks, which often involve recognizing patterns reliant on sparse or low-complexity features. This aligns with the nature of real-world data, where understanding is often contingent on a few pertinent inputs rather than dense interactions across a large feature space.

Future research could explore understanding the nuanced mechanisms that enable Transformers to avoid overfitting despite their copious parameters. Additionally, exploring how these biases can be leveraged or mitigated in scenarios where high sensitivity or more complex functions are desired could be beneficial.

The paper also prompts further exploration into the development of hybrid architectures that could combine the strengths of Transformers and LSTMs, potentially offering enhanced capabilities across a broader spectrum of tasks. Moreover, the investigation into the practical applications of the established inductive biases could advance the design of models optimized for specific domains like natural language processing, where metaphorical and abstract representations can be both sparse and nuanced.

Overall, while the paper substantiates some known properties of Transformers, it opens avenues for integrating these insights into the design of future AI systems that balance complexity with generalization. This could be particularly transformative in fields like NLP, where complexity management is crucial for effective model deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Tighter relations between sensitivity and other complexity measures. In International Colloquium on Automata, Languages, and Programming, pages 101–113. Springer.
  2. Exploring length generalization in large language models. arXiv preprint arXiv:2207.04901.
  3. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 233–242. JMLR.org.
  4. Hidden progress in deep learning: Sgd learns parities near the computational limit.
  5. On the Ability and Limitations of Transformers to Recognize Formal Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7096–7116, Online. Association for Computational Linguistics.
  6. On the practical ability of recurrent neural networks to recognize hierarchical languages. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1481–1494, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  7. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  8. David Chiang and Peter Cholak. 2022. Overcoming a theoretical limitation of self-attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7654–7664, Dublin, Ireland. Association for Computational Linguistics.
  9. Neural networks and the chomsky hierarchy. arXiv preprint arXiv:2207.02098.
  10. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  11. How can self-attention networks recognize dyck-n languages? arXiv preprint arXiv:2010.04303.
  12. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pages 5793–5831. PMLR.
  13. Leonardo Franco. 2006. Generalization ability of boolean functions implemented in feedforward neural networks. Neurocomputing, 70(1-3):351–361.
  14. Lstm recurrent networks learn simple context-free and context-sensitive languages. IEEE Transactions on Neural Networks, 12(6):1333–1340.
  15. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings.
  16. Smooth boolean functions are easy: Efficient algorithms for low-sensitivity functions. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, pages 59–70.
  17. First-order versus second-order single-layer recurrent neural networks. IEEE Transactions on Neural Networks, 5(3):511–513.
  18. Michael Hahn. 2020. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171.
  19. Sensitivity as a complexity measure for sequence classification tasks. Transactions of the Association for Computational Linguistics, 9:891–908.
  20. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics, 10:800–810.
  21. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  22. The influence of variables on Boolean functions. Citeseer.
  23. Michael Kearns. 1998. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006.
  24. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  25. John F Kolen and Stefan C Kremer. 2001. A field guide to dynamical recurrent networks. John Wiley & Sons.
  26. Samuel A Korsky and Robert C Berwick. 2019. On the computational power of rnns. arXiv preprint arXiv:1906.06349.
  27. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165.
  28. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  29. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150.
  30. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843–856.
  31. A formal hierarchy of rnn architectures. arXiv preprint arXiv:2004.08500.
  32. Sympy: symbolic computing in python. PeerJ Computer Science, 3:e103.
  33. Neural networks are a priori biased towards boolean functions with low entropy. arXiv preprint arXiv:1909.11522.
  34. Is sgd a bayesian sampler? well, almost. Journal of Machine Learning Research, 22.
  35. Foundations of machine learning. MIT press.
  36. SGD on Neural Networks Learns Functions of Increasing Complexity. Curran Associates Inc., Red Hook, NY, USA.
  37. Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations.
  38. Ryan O’Donnell. 2021. Analysis of boolean functions.
  39. Samet Oymak and Mahdi Soltanolkotabi. 2019. Overparameterized nonlinear learning: Gradient descent takes the shortest path? In International Conference on Machine Learning, pages 4951–4960. PMLR.
  40. Random deep neural networks are biased towards simple functions. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 1962–1974.
  41. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  42. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177.
  43. On the spectral bias of neural networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5301–5310. PMLR.
  44. Alexander Rush. 2018. The annotated transformer. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 52–60, Melbourne, Australia. Association for Computational Linguistics.
  45. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. Advances in Neural Information Processing Systems, 34:6695–6706.
  46. Luzi Sennhauser and Robert Berwick. 2018. Evaluating the ability of LSTMs to learn context-free grammars. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 115–124, Brussels, Belgium. Association for Computational Linguistics.
  47. Closing brackets with recurrent neural networks. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 232–239, Brussels, Belgium. Association for Computational Linguistics.
  48. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  49. LSTM networks can perform dynamic counting. In Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges, pages 44–54, Florence. Association for Computational Linguistics.
  50. On evaluating the generalization of LSTM models in formal languages. In Proceedings of the Society for Computation in Linguistics (SCiL) 2019, pages 277–286.
  51. Memorisation versus generalisation in pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7564–7578, Dublin, Ireland. Association for Computational Linguistics.
  52. Deep learning generalizes because the parameter-function map is biased towards simple functions. In International Conference on Learning Representations.
  53. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  54. A comparative study of rule extraction for recurrent neural networks. CoRR, abs/1801.05420v2.
  55. Shunjie Wang. 2021. Evaluating Transformer’s Ability to Learn Mildly Context-Sensitive Languages. University of Washington.
  56. On the practical computational power of finite precision RNNs for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 740–745, Melbourne, Australia. Association for Computational Linguistics.
  57. Andrew G Wilson and Pavel Izmailov. 2020. Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697–4708.
  58. Self-attention networks can process bounded hierarchical languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3770–3785, Online. Association for Computational Linguistics.
  59. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Satwik Bhattamishra (13 papers)
  2. Arkil Patel (14 papers)
  3. Varun Kanade (41 papers)
  4. Phil Blunsom (87 papers)
Citations (39)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com