Not all parameters are born equal: Attention is mostly what you need (2010.11859v2)

Published 22 Oct 2020 in cs.CL

Abstract: Transformers are widely used in state-of-the-art machine translation, but the key to their success is still unknown. To gain insight into this, we consider three groups of parameters: embeddings, attention, and feed forward neural network (FFN) layers. We examine the relative importance of each by performing an ablation study where we initialise them at random and freeze them, so that their weights do not change over the course of the training. Through this, we show that the attention and FFN are equally important and fulfil the same functionality in a model. We show that the decision about whether a component is frozen or allowed to train is at least as important for the final model performance as its number of parameters. At the same time, the number of parameters alone is not indicative of a component's importance. Finally, while the embedding layer is the least essential for machine translation tasks, it is the most important component for language modelling tasks.

Citations (6)

View on Semantic Scholar

Summary

The paper demonstrates that ablation studies reveal attention and FFN layers are crucial for achieving high translation performance.
The paper shows that large transformer models can retain up to 72% of BLEU scores with minimal trainable parameters, indicating significant redundancy.
The paper finds that embedding layers are indispensable for language modeling, markedly affecting perplexity results.

Not All Parameters Are Born Equal: Attention Is Mostly What You Need

Introduction

The paper explores the significance of different components in transformer models, particularly for machine translation tasks. Transformers are celebrated for their capacity to achieve state-of-the-art results in various applications, yet the underlying factors contributing to their success remained elusive. This paper categorizes transformer parameters into three key groups: embeddings, attention mechanisms, and feed-forward neural network (FFN) layers. Through a series of ablation experiments, the paper systematically evaluates the importance of each component by initializing them with random values and freezing them during training. The research highlights that although the attention and FFN layers exhibit significant importance for model performance, the embedding layer proves indispensable for language modeling tasks.

Methodology

This research implements ablation studies using different configurations of the transformer architecture—transformer-big, transformer-base, and transformer-tiny—for both neural machine translation and language modeling tasks. The methodology involves freezing the weights of specific components to identify their impact on the model's performance. By observing the resulting BLEU scores from experiments where one or more components are frozen, the paper asserts that redundancy within the model can mitigate inefficiencies caused by untrained components.

Key aspects of the methodology include:

Comparison of frozen versus trainable components and their sizes.
Analysis across different dataset sizes and transformer presets.
Assessment of LLM performance metrics such as perplexity.

Experiments and Results

Transformer-Big Model Experiments

The paper details a series of experiments on the transformer-big architecture using Turkish-English data. The findings reveal that the embedding layer plays a limited role in the machine translation task, while attention and FFN components are more crucial. In some configurations, models retained 72% of baseline BLEU scores with only 6% trainable parameters suggesting considerable redundancy within the architecture.

Further experiments showed that shrinking the embedding dimensions or FFN layers and varying their components led to distinct performance gradients. Even with significant parameter reduction, certain configurations still achieved competitive BLEU scores, highlighting the modular adaptability of transformers.

Small and Distilled Model Experiments

The paper extends these observations to smaller models, including transformer-base and knowledge-distilled student models. The smaller and distilled models showed higher sensitivity to parameter freezing, evidenced by larger performance drops compared to larger models. These results underscore the importance of attention and FFN components in small architectures, while also acknowledging the benefits of hidden redundancy in larger configurations.

LLM Experiments

With a transformer architecture adapted for language modeling, the research found contrasting importance among components. Unlike the translation task, embeddings were identified as the most critical feature for language modeling performance, substantially affecting final perplexity scores.

Discussion

The paper concludes that while attention and FFN layers significantly contribute to transformer functionality in machine translation, the embedding layer's role is more subtle—providing complementary signals rather than essential capabilities. For language modeling, embeddings are imperative for achieving low perplexity. The research highlights the interchangeability and redundancy among attention and FFN layers, suggesting that their function and expressiveness compensate for one another when restricted.

Moreover, this paper illuminates how model architecture adaptations, such as reducing certain layers' dimensions, can influence the relative importance of components and overall performance concerning frozen parameters.

Conclusion

The insights offered by the paper guide future transformer model designs by emphasizing the significance of particular components depending on task requirements. For practical applications, designers can leverage redundant architecture components where resource constraints are present, such as embedding devices with limited memory. The superior adaptability of larger models permits efficient distribution of learning responsibility among components leading to robust performance even under constraints.