Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-head or Single-head? An Empirical Comparison for Transformer Training (2106.09650v1)

Published 17 Jun 2021 in cs.CL and cs.LG

Abstract: Multi-head attention plays a crucial role in the recent success of Transformer models, which leads to consistent performance improvements over conventional attention in various applications. The popular belief is that this effectiveness stems from the ability of jointly attending multiple positions. In this paper, we first demonstrate that jointly attending multiple positions is not a unique feature of multi-head attention, as multi-layer single-head attention also attends multiple positions and is more effective. Then, we suggest the main advantage of the multi-head attention is the training stability, since it has less number of layers than the single-head attention, when attending the same number of positions. For example, 24-layer 16-head Transformer (BERT-large) and 384-layer single-head Transformer has the same total attention head number and roughly the same model size, while the multi-head one is significantly shallower. Meanwhile, we show that, with recent advances in deep learning, we can successfully stabilize the training of the 384-layer Transformer. As the training difficulty is no longer a bottleneck, substantially deeper single-head Transformer achieves consistent performance improvements without tuning hyper-parameters.

Analysis of Multi-Head and Single-Head Transformer Architectures

Introduction

The paper titled "Multi-head or Single-head? An Empirical Comparison for Transformer Training" investigates the fundamental architectures of Transformer models, specifically probing the comparative efficacy of multi-head versus single-head attention mechanisms. The evolution of Transformers has significantly influenced various applications owing to their non-recurrent, parallelizable computation that capitalizes on attention mechanisms to capture dependencies among input tokens. The predominant belief has been that multi-head attention modules are instrumental in the success of Transformers due to their purported ability to attend to multiple positions concurrently. This paper challenges that presumption by empirically evaluating both multi-head and single-head configurations, with a focus on training stability and performance effectiveness.

Key Insights and Contributions

  1. Multi-Head versus Single-Head Capability: The paper meticulously dismantles the apparent advantage of multi-head attention concerning positional attention. The authors argue that multi-layer single-head configurations can attend to multiple positions by stacking conventional attention layers, thus challenging the supposed uniqueness of multi-head attention. The research posits that the superior performance traditionally attributed to multi-head setups might be overestimated.
  2. Training Stability: A significant observation is that the primary benefit of multi-head attention is enhanced training stability rather than a fundamental improvement in model capacity. A direct comparison between shallow multi-head (BERT-large with 24 layers) and deeply stacked single-head (384-layer) Transformers indicates that the latter can achieve similar model size and attention capacity if training stability can be assured.
  3. Recent Advances Aid Training: Modern advancements, such as Adaptive Model Initialization (Admin), have been demonstrated to stabilize the training for ultra-deep single-head models. This stability enables the exploitation of the depth in single-head Transformers, which was not previously feasible due to training difficulties, leading to consistent performance improvements without extensive hyper-parameter tuning.
  4. Empirical Validation: Extensive empirical evaluations on machine translation tasks (WMT’14 EN-DE) and BERT's LLM pre-training corroborate the theoretical insights. The experiments show that deep single-head Transformers outperform their shallow multi-head counterparts in a consistent fashion across different tasks, reinforcing that depth rather than width (in terms of attention heads) is the decisive factor for enhanced model capacity.

Results and Implications

The paper's experimental results are noteworthy, indicating that training deeply stacked single-head Transformers yields superior performance, validated across standard benchmarks such as GLUE and SQuAD 2.0. The findings challenge the conventional architecture of Transformers that predominantly utilize multi-head attention, suggesting that a paradigm shift towards deeper, single-head structures could yield further improvements. These results indicate a potential redirection in future Transformer architecture design, emphasizing depth over multiplicity of attention heads.

The training efficiency analysis reveals that though deeper models inherently have longer training times due to their complexity, they nonetheless deliver better convergence behavior and performance metrics. Additionally, the exploration of different model initializations implies a nuanced understanding that design modifications can alleviate training instabilities associated with depth.

Conclusion and Future Directions

The insights from this research prompt a re-evaluation of established Transformer architectures, paving the way for future studies to leverage the depth of single-head attention mechanisms more effectively. Potential avenues include integrating neural architecture search to optimize the balance between multi-head and single-head configurations and refining initialization strategies for deep networks. Such efforts could substantiate more efficient, scalable, and capable models, propelling advancements in natural language processing and beyond. Overall, this paper contributes to a deeper understanding of how architectural variations influence Transformer efficiency and poses promising questions for further computational exploration.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Liyuan Liu (49 papers)
  2. Jialu Liu (21 papers)
  3. Jiawei Han (263 papers)
Citations (29)
Youtube Logo Streamline Icon: https://streamlinehq.com