What Does BERT Look At? An Analysis of BERT's Attention (1906.04341v1)

Published 11 Jun 2019 in cs.CL

Abstract: Large pre-trained neural networks such as BERT have had great recent success in NLP, motivating a growing body of research investigating what aspects of language they are able to learn from unlabeled data. Most recent analysis has focused on model outputs (e.g., LLM surprisal) or internal vector representations (e.g., probing classifiers). Complementary to these works, we propose methods for analyzing the attention mechanisms of pre-trained models and apply them to BERT. BERT's attention heads exhibit patterns such as attending to delimiter tokens, specific positional offsets, or broadly attending over the whole sentence, with heads in the same layer often exhibiting similar behaviors. We further show that certain attention heads correspond well to linguistic notions of syntax and coreference. For example, we find heads that attend to the direct objects of verbs, determiners of nouns, objects of prepositions, and coreferent mentions with remarkably high accuracy. Lastly, we propose an attention-based probing classifier and use it to further demonstrate that substantial syntactic information is captured in BERT's attention.

PDF Abstract

An Analysis of Attention Mechanisms in BERT: Insights and Implications

The paper "What Does BERT Look At? An Analysis of BERT's Attention," authored by Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning, presents a rigorous examination of the attention mechanisms within the BERT model. This work investigates the attention heads in BERT to understand the linguistic features they learn during pre-training.

Primary Contributions

The paper delivers several key insights into the behavior of BERT's attention heads:

Common Patterns in Attention Heads: The authors observe that BERT's attention heads frequently exhibit common behavior patterns, such as attending to positional offsets or broadly over the entire sentence. Attention heads in the same layer often behave similarly, and a notable portion of the attention focuses on special tokens such as [SEP], [CLS], and punctuation marks.
Correlation with Linguistic Phenomena: The paper demonstrates that certain attention heads correlate well with specific linguistic features. Heads are identified that consistently focus on direct objects of verbs, determiners of nouns, and coreferent mentions, showing surprisingly high accuracy.
Attention-Based Probing Classifier: An attention-based probing classifier is proposed, which further demonstrates that BERT's attention heads capture substantial syntactic information. This classifier achieves a substantial Unlabelled Attachment Score (UAS) of 77 in dependency parsing.

Detailed Findings

Surface-Level Patterns

The authors explore surface-level patterns by analyzing BERT's attention heads over 1000 random Wikipedia segments. They find that many heads attend predominantly to delimiter tokens like [SEP] and [CLS], suggesting that these tokens might be used as no-op placeholders when the head's primary function is not applicable. Moreover, they observe heads that broadly attend over the whole sentence, especially in the lower layers of the network, indicating an initial 'bag-of-words' type representation.

Specific Linguistic Features

By probing individual attention heads, the paper finds that BERT captures various syntactic relations without explicit supervision. Heads are identified that excel at finding objects of prepositions, determiners, and other specific syntactic dependencies. For instance, the head 8-10 successfully finds direct objects of verbs with an impressive accuracy of 86.8%. The coreference task similarly benefits from BERT's attention, where head 5-4 performs with notable efficacy.

Attention Head Combinations

Recognizing that syntax-related knowledge is distributed across multiple heads, the paper proposes a family of attention-based probing classifiers. The classifier leveraging BERT's attention maps along with GloVe embeddings achieves a 77 UAS, underlining that BERT's attention contains extensive syntactic information. The performance of this classifier aligns with the structural probe, signifying consistency in the syntactic information across BERT's representations and its attention mechanisms.

Implications and Future Work

The findings suggest that BERT's pre-training imparts the model with a considerable understanding of language syntax through its attention mechanisms. This discovery is pivotal as it implies that indirect supervision from tasks like LLMing can encode complex linguistic structures within the model. Future developments may focus on enhancing pre-training tasks to harness this implicit syntactic knowledge more effectively or explore pruning redundant attention heads to streamline model performance.

Moreover, extending this analysis to other languages and more complex tasks could unveil further nuances in how attention mechanisms capture linguistic phenomena across different linguistic structures. Integrating these insights into model architecture could lead to more interpretable and efficient NLP models.

Conclusion

This comprehensive analysis showcases that BERT’s attention heads encapsulate significant linguistic information. By proposing novel methods to probe these attention mechanisms, the paper contributes valuable understandings to the field. These insights broaden the field of model interpretability and open avenues for refining LLMs to leverage the syntactic awareness embedded in their attention mechanisms.