Fast Transformer Decoding: One Write-Head is All You Need (1911.02150v1)

Published 6 Nov 2019 in cs.NE, cs.CL, and cs.LG

Abstract: Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large "keys" and "values" tensors. We propose a variant called multi-query attention, where the keys and values are shared across all of the different attention "heads", greatly reducing the size of these tensors and hence the memory bandwidth requirements of incremental decoding. We verify experimentally that the resulting models can indeed be much faster to decode, and incur only minor quality degradation from the baseline.

PDF Abstract

Fast Transformer Decoding: One Write-Head is All You Need

The presented paper introduces a novel variant of the Transformer model's attention mechanism, dubbed "multi-query attention." This variant aims to address the memory bandwidth limitations faced during incremental inference in transformer models, particularly focusing on the inefficiencies caused by the traditional multi-head attention architecture.

Background and Motivation

Transformers have become a cornerstone in processing sequence data, outperforming RNNs due to their parallelizable architecture. However, despite their benefits in training speed, the multi-head attention layer's requirement for reloading large "keys" and "values" tensors during inference creates a bottleneck. The multi-query attention mechanism seeks to mitigate this by reducing the size of these tensors, thereby decreasing memory bandwidth requirements and improving inference speed with minimal degradation in model quality.

Multi-Query Attention Mechanism

This paper substitutes the multi-head attention layers with multi-query attention in which keys and values are shared across all heads, rather than having separate sets for each attention head as in the standard formulation. Such a construction retains the dimensionality and effectiveness of the queries while significantly reducing the tensor size for keys and values. This alteration aims to optimize memory access efficiency without necessitating drastic alterations to the existing architecture.

Methodology and Experiments

The efficacy of the proposed multi-query attention is substantiated through experiments on the WMT 2014 English-German translation task and the Billion-Word LLMing Benchmark. The experiments compare the performance and speed of the standard multi-head attention with the proposed multi-query attention, while maintaining the parameter footprint constant through compensatory adjustments in feed-forward hidden layer dimensions.

Experimental Results

Quality Metrics: In translation tasks, the multi-query model demonstrated a slight decrease in BLEU scores and perplexity compared to the baseline. However, the performance remained significantly superior to alternative approaches involving fewer heads or reduced dimensionalities for keys and values.
Speed Metrics: Substantial improvements were observed in inference speed. Multi-query attention enabled faster execution times during tokens decoding in the Transformer model, demonstrating a reduction in computation time for incremental steps by a factor of over 10 when compared to the baseline model.

Implications and Future Directions

The paper posits that the introduction of multi-query attention could lead to broader adoption of transformer models in settings where inference speed is paramount. Theoretically, reducing the memory-access-to-computation ratio holds potential for enhancing compatibility with modern hardware architectures, such as GPUs and TPUs.

Future research could explore integrating multi-query attention into a wider array of tasks beyond LLMing and translation, assessing trade-offs in different linguistic and computational scenarios. Additionally, further investigation into combining multi-query attention with other efficiency-improving techniques, such as sparse attention patterns or localized attention mechanisms, might yield performance improvements both in terms of speed and model robustness.

By providing a less resource-intensive alternative to traditional multi-head attention, the paper contributes to optimizing large-scale sequence model deployment, particularly in real-time and resource-constrained environments.