A Primer in BERTology: What we know about how BERT works (2002.12327v3)

Published 27 Feb 2020 in cs.CL

Abstract: Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.

PDF Abstract

A Primer in BERTology: What We Know About How BERT Works

The paper "A Primer in BERTology: What We Know About How BERT Works" by Anna Rogers, Olga Kovaleva, and Anna Rumshisky provides an extensive survey of the current understanding of the BERT model. The survey critically examines over 150 studies, elucidating various facets of BERT, including its architecture, knowledge representations, modifications, and training mechanisms.

Understanding BERT Architecture

BERT (Bidirectional Encoder Representations from Transformers) is fundamentally composed of multiple layers of Transformer encoders. Each encoder layer consists of several self-attention heads, which compute different key, value, and query vectors, merging their outputs to produce a contextual representation for each input token. The architecture includes key stages: pre-training using Masked LLMing (MLM) and Next Sentence Prediction (NSP) tasks, followed by fine-tuning for downstream applications.

Linguistic and World Knowledge

The paper highlights several findings related to the linguistic knowledge encoded within BERT's weights:

Syntactic Knowledge: Studies indicate BERT captures substantial syntactic information, including parts of speech, syntactic roles, and even hierarchical syntactic structures, although its knowledge is not comprehensive. BERT embeddings can recover syntactic dependencies, although the ability to extract full parse trees using self-attention weights remains limited.
Semantic Knowledge: BERT has demonstrated some competency in understanding semantic roles and entity types. Nevertheless, it struggles with certain semantic phenomena such as negation and numerical representations and exhibits brittleness in named entity recognition.
World Knowledge: BERT exhibits a certain degree of commonsense and world knowledge, competitive with some traditional knowledge bases. However, it struggles to perform reasoning tasks that require inferencing from known facts and associations.

BERT Components and Functionality

Embeddings and Self-Attention Heads

The paper maps out the characteristics of BERT embeddings, noting their contextual nature and ability to represent polysemy through distinct contextual clusters. Self-attention heads in BERT display various identifiable patterns, some of which correlate with syntactic and semantic functions.

Redundancy and Overparameterization

Experiments have revealed high levels of redundancy in BERT's architecture, with many self-attention heads and layers being superfluous to performance. Pruning studies demonstrate that BERT can maintain efficacy even when a significant portion of its parameters is removed, underscoring its overparameterization.

Training and Modifications

Model Architecture and Training

The paper notes that changes in model architecture, like varying the number of layers and heads, affect BERT's performance, often preferring deeper models over a higher number of attention heads. Improvements in training regimes have shown benefits such as enhanced performance through larger batch sizes and warm-starting strategies that re-use trained parameters.

Pre-Training Objectives

The original pre-training tasks (MLM and NSP) have inspired numerous alternatives aiming to improve BERT's overall performance. Diverse masking strategies, task and context-specific objectives, and enhanced data incorporation lead to improved efficiency and accuracy in several downstream tasks.

Fine-Tuning Strategies

Fine-tuning techniques explored include intermediate supervised training stages, adversarial training, and attention to all BERT layers rather than just the final one. These strategies help adapt the pre-trained model more robustly to specific tasks.

Compression and Efficiency

Given the computational expense of running large models like BERT, several compression strategies have been explored:

Distillation: Training smaller student models that mimic the behavior of larger teacher models.
Quantization: Lowering the precision of model weights to reduce memory footprint.
Pruning: Removing redundant components (heads, layers) within the architecture.

These methods have shown that significant reductions in model size and computational requirements do not necessarily lead to substantial losses in performance, prompting the exploration of more efficient model configurations.

Future Directions

The paper outlines several promising areas for further research:

Linguistic Competence Benchmarks: Developing comprehensive benchmarks to test and stress different aspects of linguistic knowledge.
Understanding Inference: Uncovering what knowledge BERT uses during inference and focusing on teaching reasoning capabilities.
Model Interpretation: Employing advanced probing techniques to better understand BERT's learning and decision-making processes.

Conclusion

This survey consolidates a wealth of research on BERT, providing a clearer understanding of its capabilities and limitations. It serves as a crucial resource for researchers aiming to improve BERT's architecture, training processes, and application efficiency, guiding future advancements in natural language processing models.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Anna Rogers (27 papers)
Olga Kovaleva (6 papers)
Anna Rumshisky (42 papers)

Citations (1,379)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/aryaman2020/status/1807946309472772432

YouTube

Show All Videos