Cottention: Linear Transformers With Cosine Attention

Published 27 Sep 2024 in cs.LG | (2409.18747v1)

Abstract: Attention mechanisms, particularly softmax attention, have been instrumental in the success of transformer-based models such as GPT. However, the quadratic memory complexity of softmax attention with respect to sequence length poses significant challenges for processing longer sequences. We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity. By leveraging the properties of cosine similarity and rearranging the attention equation, Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention. We demonstrate that Cottention can be reformulated as a recurrent neural network (RNN) with a finite hidden state, allowing for constant memory usage during inference. We evaluate Cottention on both the bidirectional BERT and causal GPT tasks, demonstrating comparable performance to softmax attention while significantly reducing memory requirements. To ensure efficient computation, we develop a custom CUDA kernel for Cottention. Our results show that Cottention is a promising alternative to softmax attention, enabling the processing of longer sequences without sacrificing performance, due to its native linear memory complexity and ability to maintain a constant memory footprint during inference.

Abstract PDF HTML Upgrade to Chat

Authors (3)

Summary

The paper introduces Cottention, which reformulates the attention mechanism using cosine similarity to achieve linear memory and time complexity.
It implements a custom CUDA kernel and an RNN-inspired design to ensure efficient computation and constant memory usage during inference.
Performance evaluations demonstrate that cosine attention preserves comparable accuracy to softmax in models like BERT and GPT-J while processing longer sequences efficiently.

Cottention: Linear Transformers With Cosine Attention

Introduction

The paper "Cottention: Linear Transformers With Cosine Attention" presents a novel approach to attention mechanisms, addressing the limitations of softmax attention in terms of memory complexity. Softmax attention's quadratic memory and time complexity restrict the efficient processing of long sequences. This research introduces "Cottention," which replaces the softmax operation with cosine similarity, achieving linear complexity relative to sequence length. The reformulation allows Cottention to be expressed as a recurrent neural network (RNN), enabling constant memory usage during inference for both bidirectional and causal tasks. This paper explores the implementation and evaluates its performance in comparison to traditional attention mechanisms.

Cosine Attention Mechanism

Cosine attention leverages the properties of cosine similarity to sidestep the quadratic complexity typically associated with softmax attention. By normalizing the query and key vectors and utilizing an optimized linear algebraic formulation, Cottention achieves a linear memory footprint. The core of cosine attention is the computation of a similarity matrix through L2 normalization of the queries and keys, conducting the following operation:

$\text{CosAttention}(Q, K, V) = \mathcal{N}(Q) \cdot \mathcal{N}(K)^T \cdot V$

where $\mathcal{N}(X) = \frac{X}{\lVert X \rVert_2}$ denotes L2 normalization. The use of a custom CUDA kernel ensures the efficient computation of this operation.

Figure 1: Recurrent neural network representation of cosine attention where the queries, keys, and values are of shape $(N, H, (d_{H\_key/H\_key/H\_value}$ )

Cosine attention also addresses potential instability during training by introducing a learnable scalar parameter for normalization. This parameter is initialized and then optimized during training, allowing the model to adapt its attention scaling appropriately.

Performance Evaluation

Cosine attention demonstrates compelling performance metrics when evaluated against traditional softmax mechanisms. Testing on both bidirectional (BERT) and causal (GPT-J) models reveals that Cottention achieves comparable accuracy while maintaining a significantly reduced memory footprint. This efficiency is particularly valuable when processing sequences longer than what is feasible with standard softmax attention.

Figure 2: Perplexity comparison for models with 300M (left) and 1.2B (right) parameters.

In terms of computational resource usage, cosine attention exhibits linear complexity in memory as a function of sequence length. Figure 3 highlights these attributes, contrasting the exponential scaling of softmax attention and the linear growth observed with cosine similarity-based models.

Figure 3: Time and memory usage comparison between softmax and cosine attention models. Softmax models exhibit quadratic complexity, while cosine models demonstrate linear complexity with respect to sequence length.

Algorithmic Insights

The architectural underpinnings of cosine attention allow it to be framed in terms of an RNN. This conceptual shift facilitates constant memory usage during inference—an attractive feature in scenarios where computational resources are constrained. The hidden state design mimics RNN dynamics and proves beneficial in modeling long-range dependencies without substantial resource allocation typically required for softmax-based attention.

Conclusion and Future Directions

The introduction of cosine attention provides a notable step forward in reducing the computational burden associated with traditional attention mechanisms. By maintaining comparable performance levels and facilitating longer sequence processing, Cottention positions itself as a powerful alternative for large-scale language processing tasks. Further work will involve optimizing custom CUDA implementations, exploring enhanced normalization strategies, and scaling this approach to larger model architectures. The potential of matrix factorization techniques could further unleash new efficiencies, leveraging the distinct properties of cosine-based formulations. Overall, this work lays promising groundwork for the future of efficient and scalable attention mechanisms in AI.

Markdown Report Issue