Simplifying Transformer Blocks (2311.01906v2)

Published 3 Nov 2023 in cs.LG

Abstract: A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable. In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput, and using 15% fewer parameters.

Citations (23)

View on Semantic Scholar

Summary

The paper introduces a Simplified Attention Sub-block that halves parameters and boosts throughput by 15%.
It demonstrates a parallel sub-block combination that removes sequential dependencies while scaling effectively for deep transformers.
Empirical results reveal up to 16% faster training speeds with competitive performance on CodeParrot, BERT, and GLUE benchmarks.

Simplifying Transformer Blocks in Neural Networks

The paper "Simplifying Transformer Blocks" by Bobby He and Thomas Hofmann expounds on methods to streamline the transformer architecture without compromising performance and training efficiency. This paper critically evaluates the necessity of various components within the standard transformer block, aiming to reduce the complexity of these architectures, which can lead to more efficient training and inference pipelines.

Core Contributions

Transformers, since their introduction by Vaswani et al. (2017), have become foundational in many state-of-the-art neural network applications. However, the standard transformer architecture, interlacing attention mechanisms and MLP sub-blocks with skip connections and normalization layers, is intricate. This complexity can sometimes result in brittle setups where minor modifications significantly impact performance. The authors aim to investigate if the standard transformer block can be simplified without losing training efficiency.

Key contributions include:

Simplified Attention Mechanism:
- The authors introduce the Simplified Attention Sub-block (SAS). By removing skip connections and fixing value and projection parameters, the SAS maintains performance while halving the parameter count in the attention sub-block. Notably, this simplification achieves a 13% reduction in the overall model parameter count and results in a 15% faster throughput.
Parallel Sub-block Combination:
- They further refine the transformer block by leveraging the parallel sub-block approach as seen in models like PALM and ViT-22B. Combining the SAS with a parallel arrangement of MHA and MLP sub-blocks effectively removes all remaining skip connections and sequential dependencies, maintaining robust training speeds.
Removing Normalization Layers:
- Finally, the theoretical support is provided for eliminating normalization layers due to their implicit role in signal propagation and training dynamics, although empirical results suggest that retaining normalization yields better training stability.

Experimental Results

The paper's experimental backbone is rigorous and multi-faceted, covering auto-regressive GPT models on CodeParrot datasets and the BERT model on the Pile dataset with downstream evaluation on the GLUE benchmark.

CodeParrot Dataset:
- The paper establishes that the SAS and SAS-P blocks match or slightly outperform the Pre-LN transformer blocks. When depth is increased to 72 layers, Simplified transformers not only scale effectively but also maintain superior training speeds, unlike previous methods that falter at increased depths.
BERT and GLUE Benchmark:
- The SAS and SAS-P blocks demonstrate competitive performance against the Crammed-BERT baseline. With a reduction of approximately 16% in parameter counts, these models maintain parity in downstream GLUE benchmark performance while achieving up to 16% faster training speeds.
Efficiency Metrics:
- Across the conducted experiments, the simplified models consistently achieve notable efficiency gains, suggesting significant potential cost savings in training and deploying large transformer models.

Implications and Future Directions

The implications of this research span both theoretical and practical domains. By simplifying transformer components, the paper aids in bridging the gap between deep learning theory and practice. The reduction in parameter count and improvements in training throughput directly translate to decreased computational costs and faster deployment cycles.

Moving forward, this simplification paradigm could inspire further research into even more efficient transformer models, particularly at larger model scales. The exploration into hyperparameter tuning and optimization techniques tailored to these simplified architectures could yield additional performance gains. Moreover, understanding the underlying benefits of normalization layers within this context could offer deeper insights into transformer training dynamics.

In summary, "Simplifying Transformer Blocks" presents a compelling methodology to streamline transformer architectures, offering avenues for more efficient and scalable neural networks. This work substantiates meaningful reductions in model complexity, laying the groundwork for future advancements in transformer model optimization.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rasbt/status/1774973991318372422

https://twitter.com/_arohan_/status/1793346994775400860

https://twitter.com/_arohan_/status/1793347575959232746

https://twitter.com/rasbt/status/1890407195201683915

https://twitter.com/darkproger/status/1771577060638474564

https://twitter.com/9115762/status/1734995518386819294