Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

cosFormer: Rethinking Softmax in Attention (2202.08791v1)

Published 17 Feb 2022 in cs.CL
cosFormer: Rethinking Softmax in Attention

Abstract: Transformer has shown great successes in natural language processing, computer vision, and audio processing. As one of its core components, the softmax attention helps to capture long-range dependencies yet prohibits its scale-up due to the quadratic space and time complexity to the sequence length. Kernel methods are often adopted to reduce the complexity by approximating the softmax operator. Nevertheless, due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops when compared with the vanilla softmax attention. In this paper, we propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer in both casual and cross attentions. cosFormer is based on two key properties of softmax attention: i). non-negativeness of the attention matrix; ii). a non-linear re-weighting scheme that can concentrate the distribution of the attention matrix. As its linear substitute, cosFormer fulfills these properties with a linear operator and a cosine-based distance re-weighting mechanism. Extensive experiments on LLMing and text understanding tasks demonstrate the effectiveness of our method. We further examine our method on long sequences and achieve state-of-the-art performance on the Long-Range Arena benchmark. The source code is available at https://github.com/OpenNLPLab/cosFormer.

An Overview of cosFormer: Rethinking Softmax in Linear Transformers

The paper presents a novel approach towards enhancing the efficiency of Transformers, a class of models that have achieved significant success across diverse AI fields such as NLP, computer vision, and audio processing. At the core of the Transformers' attention mechanism is the softmax operation, which demonstrates excellent efficiency for capturing long-range dependencies. However, the quadratic time and space complexity of the softmax attention matrix becomes a bottleneck with the increasing length of input sequences. To address this issue, various efforts have concentrated on kernel approximations to achieve linear complexity, yet such methods suffer from approximation errors and lack robustness across different tasks. The research introduces cosFormer, an innovative linear transformer variant that not only reduces complexity but aims to match or exceed the performance of vanilla Transformers.

The development of cosFormer relies on two fundamental properties inherent to softmax attention: the non-negativity of the attention matrix, and the non-linear re-weighting technique that facilitates concentrating attention. cosFormer retains these characteristics using a ReLU activation and introduces a cosine-based re-weighting mechanism. The proposed method claims to outperform other linear attention models and, in some cases, conventional Transformers. Extensive benchmarks, encompassing LLMing, text understanding, and long sequence processing, validate the effectiveness of cosFormer.

Key Contributions and Results

  1. Non-Negative Property Enforcement:
    • cosFormer leverages the application of a ReLU function to ensure all values in the attention matrix are non-negative. This strategy effectively excludes negatively correlated contextual information, which potentially degrades model performance.
  2. Cosine-Based Re-weighting Mechanism:
    • Employing a cosine-based re-weighting structure introduces a non-linear component that replicates the advantageous traits of softmax, such as distribution concentration and training stabilization. This approach inherently encodes a locality bias favorable for numerous NLP tasks.
  3. Linear Time and Space Complexity:
    • By rearranging the typical attention computation sequence, cosFormer reduces both time and space complexity of attention mechanisms to linear proportions concerning input length. This efficiency gain facilitates the handling of longer sequences, a critical requirement in contemporary NLP applications.
  4. Empirical Validation:
    • cosFormer demonstrates superior performance on the Long-Range Arena benchmark, securing top positions across several tasks. It also showcases competitive results on autoregressive and bidirectional LLMing, and downstream text classification tasks.

The source code for cosFormer is made publicly accessible, encouraging further exploration and validation across varied datasets and application domains.

Implications and Future Directions

The development of cosFormer presents significant implications for both theoretical understanding and practical deployment of Transformers in large-scale applications. The incorporation of a cosine-based re-weighting mechanism is particularly noteworthy, merging simplicity with the capacity to emulate the refined properties of softmax attention. This contribution prompts a broader consideration of how other non-linearities may be innovatively exploited within attention mechanisms to improve efficiency without compromising accuracy.

Future work may explore extensions of cosFormer within other domains such as time-series forecasting or expand upon relative positional encodings to amplify its current capabilities. There are also open opportunities to analytically formalize the nuanced impacts of the cosine-based re-weighting on Transformer interpretability, potentially shedding light on novel multi-head dynamics.

In summary, cosFormer challenges existing paradigms by elegantly disentangling the complexities associated with softmax in Transformers and setting a foundation upon which more scalable and versatile attention mechanisms can be constructed.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Zhen Qin (105 papers)
  2. Weixuan Sun (31 papers)
  3. Hui Deng (133 papers)
  4. Dongxu Li (40 papers)
  5. Yunshen Wei (2 papers)
  6. Baohong Lv (2 papers)
  7. Junjie Yan (109 papers)
  8. Lingpeng Kong (134 papers)
  9. Yiran Zhong (75 papers)
Citations (178)
X Twitter Logo Streamline Icon: https://streamlinehq.com