An Analysis of Attention Mechanisms in BERT: Insights and Implications
The paper "What Does BERT Look At? An Analysis of BERT's Attention," authored by Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning, presents a rigorous examination of the attention mechanisms within the BERT model. This work investigates the attention heads in BERT to understand the linguistic features they learn during pre-training.
Primary Contributions
The paper delivers several key insights into the behavior of BERT's attention heads:
- Common Patterns in Attention Heads: The authors observe that BERT's attention heads frequently exhibit common behavior patterns, such as attending to positional offsets or broadly over the entire sentence. Attention heads in the same layer often behave similarly, and a notable portion of the attention focuses on special tokens such as [SEP], [CLS], and punctuation marks.
- Correlation with Linguistic Phenomena: The paper demonstrates that certain attention heads correlate well with specific linguistic features. Heads are identified that consistently focus on direct objects of verbs, determiners of nouns, and coreferent mentions, showing surprisingly high accuracy.
- Attention-Based Probing Classifier: An attention-based probing classifier is proposed, which further demonstrates that BERT's attention heads capture substantial syntactic information. This classifier achieves a substantial Unlabelled Attachment Score (UAS) of 77 in dependency parsing.
Detailed Findings
Surface-Level Patterns
The authors explore surface-level patterns by analyzing BERT's attention heads over 1000 random Wikipedia segments. They find that many heads attend predominantly to delimiter tokens like [SEP] and [CLS], suggesting that these tokens might be used as no-op placeholders when the head's primary function is not applicable. Moreover, they observe heads that broadly attend over the whole sentence, especially in the lower layers of the network, indicating an initial 'bag-of-words' type representation.
Specific Linguistic Features
By probing individual attention heads, the paper finds that BERT captures various syntactic relations without explicit supervision. Heads are identified that excel at finding objects of prepositions, determiners, and other specific syntactic dependencies. For instance, the head 8-10 successfully finds direct objects of verbs with an impressive accuracy of 86.8%. The coreference task similarly benefits from BERT's attention, where head 5-4 performs with notable efficacy.
Attention Head Combinations
Recognizing that syntax-related knowledge is distributed across multiple heads, the paper proposes a family of attention-based probing classifiers. The classifier leveraging BERT's attention maps along with GloVe embeddings achieves a 77 UAS, underlining that BERT's attention contains extensive syntactic information. The performance of this classifier aligns with the structural probe, signifying consistency in the syntactic information across BERT's representations and its attention mechanisms.
Implications and Future Work
The findings suggest that BERT's pre-training imparts the model with a considerable understanding of language syntax through its attention mechanisms. This discovery is pivotal as it implies that indirect supervision from tasks like LLMing can encode complex linguistic structures within the model. Future developments may focus on enhancing pre-training tasks to harness this implicit syntactic knowledge more effectively or explore pruning redundant attention heads to streamline model performance.
Moreover, extending this analysis to other languages and more complex tasks could unveil further nuances in how attention mechanisms capture linguistic phenomena across different linguistic structures. Integrating these insights into model architecture could lead to more interpretable and efficient NLP models.
Conclusion
This comprehensive analysis showcases that BERT’s attention heads encapsulate significant linguistic information. By proposing novel methods to probe these attention mechanisms, the paper contributes valuable understandings to the field. These insights broaden the field of model interpretability and open avenues for refining LLMs to leverage the syntactic awareness embedded in their attention mechanisms.