Memory-efficient Transformers via Top-$k$ Attention (2106.06899v1)

Published 13 Jun 2021 in cs.CL and cs.LG

Abstract: Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient, it is not possible to directly use them with popular pre-trained LLMs trained using vanilla attention, without an expensive corrective pre-training stage. In this work, we propose a simple yet highly accurate approximation for vanilla attention. We process the queries in chunks, and for each query, compute the top-$k$ scores with respect to the keys. Our approach offers several advantages: (a) its memory usage is linear in the input size, similar to linear attention variants, such as Performer and RFA (b) it is a drop-in replacement for vanilla attention that does not require any corrective pre-training, and (c) it can also lead to significant memory savings in the feed-forward layers after casting them into the familiar query-key-value framework. We evaluate the quality of top-$k$ approximation for multi-head attention layers on the Long Range Arena Benchmark, and for feed-forward layers of T5 and UnifiedQA on multiple QA datasets. We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.

PDF Abstract

Memory-Efficient Transformers via Top- $k$ Attention

In the field of natural language processing, the Transformer architecture has established itself as a cornerstone for various applications, from machine translation to question answering. A key component of this architecture is the attention mechanism, specifically dot-product attention, which is computationally intensive, characterized by $\Omega(L^2)$ memory complexity due to the input sequence length $L$ . As addressing the quadratic complexity became imperative, several approximation techniques have been proposed. However, many of these approaches are not directly applicable with existing pre-trained models trained using vanilla attention, requiring additional corrective pre-training.

The paper "Memory-efficient Transformers via Top- $k$ Attention" introduces a novel, efficient, and highly accurate approximation named top- $k$ attention, aiming to reduce the memory footprint and computation cost of vanilla attention without necessitating corrective pre-training. This method processes queries in chunks, calculating the top- $k$ scores concerning the keys, thus maintaining a linear memory usage relative to the input size, comparable to linear attention variants like Performer and RFA.

Key Contributions and Numerical Results

Linear Memory Consumption: The top- $k$ attention approach offers linear memory usage in contrast to the quadratic requirement of vanilla attention, significantly reducing memory consumption, especially in feed-forward layers where it integrates the query-key-value framework by switching to ReLU instead of the row-wise softmax activation.
Benchmarking Performance: On tasks involving multi-head attention layers and feed-forward layers of models like T5 and UnifiedQA, top- $k$ attention achieved nearly identical accuracy to vanilla attention across training from scratch, fine-tuning, and zero-shot inference setups. These empirical evaluations were conducted on benchmarks designed to test long-sequence processing capabilities, such as the Long Range Arena Benchmark.
Efficiency: The method demonstrated the ability to sustain competitive runtime performance, while simultaneously providing the memory savings that are crucial for processing very long inputs. Notably, vanilla attention was not possible beyond certain sequence length due to memory constraints, which top- $k$ attention efficiently managed.

Contrasting Claims

The paper asserts the plug-and-play nature of top- $k$ attention, differentiating it from other variants requiring significant model adaptation phases. Moreover, the implications for memory efficiency mark a stark deviation from the previously linear-only offerings, establishing top- $k$ attention as usable directly on models demonstrated across UnifiedQA, without performance compromise even with a drastic reduction in keys involved (less than 1% of the original number).

Implications and Future Directions

Practically, top- $k$ attention facilitates the use of large pre-trained models for researchers with limited computational resources, enabling experimentation and deployment of models on datasets with long contexts or substantial feed-forward dimensions. Theoretically, this approach hints at further integration of sparse approximations that efficiently handle computational aspects while remaining adaptable to pre-existing architectures.

Future developments could explore optimizing the inherent sparsity using specialized computation techniques, potentially augmenting time efficiency alongside memory savings. Additionally, adapting these innovations to more expansive tasks or extending them to other model architectures remains a compelling direction that could vastly influence the trajectory of efficient neural computation methodologies.

In summary, the paper provides a robust evaluation of top- $k$ attention, highlighting its transformative potential in the efficient management of Transformer models and paving the path for further exploration in computationally constrained environments.