Transformer Quality in Linear Time (2202.10447v2)

Published 21 Feb 2022 in cs.LG, cs.AI, cs.CL, and cs.NE

Abstract: We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences. First, we propose a simple layer named gated attention unit, which allows the use of a weaker single-head attention with minimal quality loss. We then propose a linear approximation method complementary to this new layer, which is accelerator-friendly and highly competitive in quality. The resulting model, named FLASH, matches the perplexity of improved Transformers over both short (512) and long (8K) context lengths, achieving training speedups of up to 4.9$\times$ on Wiki-40B and 12.1$\times$ on PG-19 for auto-regressive LLMing, and 4.8$\times$ on C4 for masked LLMing.

PDF Abstract

Transformer Quality in Linear Time

The paper, titled "Transformer Quality in Linear Time," introduces innovations aimed at optimizing Transformer models' efficiency by addressing inherent limitations, particularly in processing long sequences. When dealing with lengthy context sizes, traditional Transformers are often constrained by their quadratic complexity, affecting their ability to effectively process extensive contexts integral to many applications. This publication proposes solutions to these challenges by devising novel Transformer structures and approximation strategies to enhance processing speed while maintaining, if not improving, the model's efficacy in producing quality results.

The paper introduces the Gated Attention Unit (GAU), an innovative layer designed to lessen the computational burden typically associated with self-attention mechanisms. Unlike traditional Transformers that employ multi-head self-attention (MHSA), GAU utilizes a single-head attention, which results in reduced complexity and computation demand while preserving performance quality. The GAU's efficiency is enhanced by integrating a gating mechanism, which alleviates the necessity for complex attention operations, allowing the model to adopt simpler strategies without significant losses in quality. Experiments conducted with GAUs demonstrate improved efficiency, maintaining comparable or superior performance against traditional Transformers with shorter context lengths.

Moreover, the paper explores and develops methods to substitute quadratic attention complexity with linear approximations. The proposed approach, termed FLASH (Fast Linear Attention with a Single Head), leverages chunking methodologies by splitting sequences into non-overlapping segments or chunks. This technique allows precise quadratic attention within each chunk and a broader linear attention across chunks, drastically reducing the computational complexity from quadratic to linear relative to the context length. This design ensures that FLASH remains effective for both short and extended contexts, providing substantial speedups in training without compromising model quality. For instance, FLASH achieves up to a 12.1 times speedup over traditional Transformers in specific LLMing tasks while delivering competitive perplexity scores.

A critical advantage of FLASH lies in its efficient auto-regressive training capabilities. Traditional linear attention models grapple with inefficiencies during training, marked by a dependency on sequential updates that do not fully exploit modern accelerators' capabilities. The FLASH model, with its chunking strategy, mitigates these inefficiencies, allowing for a more efficient parallel processing that maintains output quality in tasks demanding extensive data sequences.

The research presents robust empirical evidence through comprehensive experiments across various tasks and datasets, including Wiki-40B for LLMing. Indicating that even as models scale in size, FLASH consistently matches or outperforms augmented Transformers like Transformer++ in terms of training speed and quality, positioning itself as a robust contender in handling daunting sequence lengths.

In conclusion, the methodologies proposed in this work highlight a shift towards optimized Transformer architectures that efficiently accommodate mounting data sequence demands. Moving forward, these innovations could act as foundational elements in developing AI models that require efficient processing of long sequences, offering new dimensions for scalable AI applications. Future research might investigate further refinements of FLASH, exploring applications across diverse domains, and possibly extending the paradigm to other architectures lending insights into achieving consistent linear complexity without forfeiting performance.