Improving Transformers with Dynamically Composable Multi-Head Attention (2405.08553v2)

Published 14 May 2024 in cs.LG and cs.CL

Abstract: Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads work independently, causing problems such as low-rank bottleneck of attention score matrices and head redundancy. We propose Dynamically Composable Multi-Head Attention (DCMHA), a parameter and computation efficient attention architecture that tackles the shortcomings of MHA and increases the expressive power of the model by dynamically composing attention heads. At the core of DCMHA is a $\it{Compose}$ function that transforms the attention score and weight matrices in an input-dependent way. DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer. DCFormer significantly outperforms Transformer on different architectures and model scales in LLMing, matching the performance of models with ~1.7x-2.0x compute. For example, DCPythia-6.9B outperforms open source Pythia-12B on both pretraining perplexity and downstream task evaluation. The code and models are available at https://github.com/Caiyun-AI/DCFormer.

PDF Abstract

Enhancing Transformers with Dynamically Composable Multi-Head Attention

The research paper titled "Improving Transformers with Dynamically Composable Multi-Head Attention" presents an advanced architectural modification to the traditional Transformer model by introducing the Dynamically Composable Multi-Head Attention (DCMHA). The core innovation lies in enhancing the expressive power of Multi-Head Attention (MHA), a fundamental component of the Transformer, by dynamically composing attention heads in a parameter and computation-efficient manner. This paper’s authors propose the DCMHA as a replacement for conventional MHA, aiming to mitigate identified limitations such as the low-rank bottleneck and head redundancy in attention score matrices.

Key Contributions and Methodology:

Dynamic Composition Framework: The paper introduces a novel framework for dynamically combining attention heads, utilizing both query and key-dependent transformations of attention score and weight matrices. This dynamic approach is designed to increase model expressiveness beyond what static methods offer.
Efficient Attention Matrix Composition: Instead of expanding the dimensions of QK and OV projections for each head, DCMHA composes attention matrices. This composition utilizes a low-rank plus diagonal decomposition for parameter efficiency and involves both pre-composition (on attention scores) and post-composition (on attention weights).
Implementation and Integration: DCMHA can be easily integrated as a drop-in replacement for MHA in existing Transformer architectures, effectively forming a new modified Transformer termed DCFormer. The DCFormer achieves notable improvements across different model scales and architectures, including the advanced LLaMA architecture.
Empirical Results and Scalability: Experimental results indicate that DCFormer significantly outperforms the baseline Transformer models in LLMing tasks, achieving performance levels comparable with models requiring 1.7–2.0 times more computational resources. The evaluation covers a range of model sizes from 405M to 6.9B parameters, demonstrating the favorable scaling properties of DCMHA.

Implications and Future Directions:

The enhancement of Transformers using DCMHA holds substantial implications for the field of artificial intelligence, particularly in the domain of LLMing. By addressing the inefficiencies of MHA with dynamic capabilities, DCFormer can potentially reduce computation requirements while improving performance—a crucial advancement given the growing complexity and size of LLMs.

The successful implementation of DCMHA in both natural language and vision transformers suggests that this compositional approach might be broadly applicable across different modalities and architectures. However, the added complexity also introduces some computational overhead; thus, future exploration could optimize the balance between expressive power and computational cost further. Additionally, it will be valuable to explore the interpretability of dynamically composed attention mechanisms to better understand their decision-making processes and enhance their transparency.

In conclusion, the Dynamically Composable Multi-Head Attention mechanism represents a significant step in improving the adaptability and efficiency of Transformer models, with promising applications across various AI tasks and substantial potential for further refinement and innovation.