Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions (2211.12316v2)

Published 22 Nov 2022 in cs.LG and cs.CL

Abstract: Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer's effective generalization performance despite relatively limited expressiveness.

References (59)

Authors (4)

Satwik Bhattamishra (13 papers)
Arkil Patel (14 papers)
Varun Kanade (41 papers)
Phil Blunsom (87 papers)

Citations (39)

View on Semantic Scholar

Summary

The paper demonstrates that randomly initialized Transformers inherently favor low-sensitivity functions, enabling them to effectively learn sparse Boolean functions.
Empirical analysis reveals that during training, Transformers converge to low-sensitivity solutions while LSTMs tend to overfit on sparse patterns.
The study establishes sensitivity as a reliable complexity measure, correlating with SOP size and entropy, which informs future hybrid model designs.

Simplicity Bias in Transformers and Their Ability to Learn Sparse Boolean Functions

The paper investigates the inherent inductive biases of Transformers, particularly their propensity toward learning functions with low sensitivity, and contrasts this with recurrent neural networks (RNNs) like LSTMs. The detailed empirical analysis aims to elucidate why Transformers exhibit superior practical performance in spite of theoretical limitations, especially concerning their expressiveness compared to recurrent models.

Key Findings

Simplicity Bias of Random Transformers: The paper shows that randomly initialized Transformer models are more likely to represent low-sensitivity functions compared to LSTMs. This bias is observed regardless of whether weights are initialized uniformly or using other common strategies like Gaussian or Xavier initialization.
Training Dynamics and Sensitivity: During training on Boolean functions, both Transformers and LSTMs tend to initially learn functions of lower sensitivity. However, after achieving near-zero training error, Transformers converge to solutions with significantly lower sensitivity than their recurrent counterparts.
Robustness in Learning Sparse Boolean Functions: The paper finds that Transformers are notably effective at generalizing sparse Boolean functions, such as sparse parities, even in the face of noisy labels. In contrast, LSTMs tend to overfit, achieving perfect training accuracy while failing to generalize on test sets for these functions.
Relationship Between Sensitivity and Other Complexity Measures: The paper correlates sensitivity with other complexity measures like Sum of Products (SOP) size and entropy. It concludes that sensitivity aligns well with these measures and can serve as a tractable estimate of function complexity.

Implications and Future Directions

The findings about Transformers' bias towards low-sensitivity functions suggest an alignment between their architecture and the nature of many practical tasks, which often involve recognizing patterns reliant on sparse or low-complexity features. This aligns with the nature of real-world data, where understanding is often contingent on a few pertinent inputs rather than dense interactions across a large feature space.

Future research could explore understanding the nuanced mechanisms that enable Transformers to avoid overfitting despite their copious parameters. Additionally, exploring how these biases can be leveraged or mitigated in scenarios where high sensitivity or more complex functions are desired could be beneficial.

The paper also prompts further exploration into the development of hybrid architectures that could combine the strengths of Transformers and LSTMs, potentially offering enhanced capabilities across a broader spectrum of tasks. Moreover, the investigation into the practical applications of the established inductive biases could advance the design of models optimized for specific domains like natural language processing, where metaphorical and abstract representations can be both sparse and nuanced.

Overall, while the paper substantiates some known properties of Transformers, it opens avenues for integrating these insights into the design of future AI systems that balance complexity with generalization. This could be particularly transformative in fields like NLP, where complexity management is crucial for effective model deployment.

PDF Markdown

Tweets

YouTube

Show All Videos