Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Softmax-free Linear Transformers (2207.03341v3)

Published 5 Jul 2022 in cs.CV, cs.AI, and cs.LG
Softmax-free Linear Transformers

Abstract: Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-attention mechanism underpinning the strength of ViTs has a quadratic complexity in both computation and memory usage. This motivates the development of approximating the self-attention at linear complexity. However, an in-depth analysis in this work reveals that existing methods are either theoretically flawed or empirically ineffective for visual recognition. We identify that their limitations are rooted in the inheritance of softmax-based self-attention during approximations, that is, normalizing the scaled dot-product between token feature vectors using the softmax function. As preserving the softmax operation challenges any subsequent linearization efforts. By this insight, a family of Softmax-Free Transformers (SOFT) are proposed. Specifically, a Gaussian kernel function is adopted to replace the dot-product similarity, enabling a full self-attention matrix to be approximated under low-rank matrix decomposition. For computational robustness, we estimate the Moore-Penrose inverse using an iterative Newton-Raphson method in the forward process only, while calculating its theoretical gradients only once in the backward process. To further expand applicability (e.g., dense prediction tasks), an efficient symmetric normalization technique is introduced. Extensive experiments on ImageNet, COCO, and ADE20K show that our SOFT significantly improves the computational efficiency of existing ViT variants. With linear complexity, much longer token sequences are permitted by SOFT, resulting in superior trade-off between accuracy and complexity. Code and models are available at https://github.com/fudan-zvg/SOFT.

Softmax-Free Linear Transformers

The paper "Softmax-free Linear Transformers," authored by Jiachen Lu et al., addresses a notable challenge in the current trajectory of Vision Transformers (ViTs), specifically the quadratic complexity inherent in computation and memory usage due to the self-attention mechanism's reliance on softmax. The authors propose a novel approach that circumvents this complexity by removing the softmax normalization, proposing a family of Softmax-Free Transformers (SOFT).

Key Insights and Contributions

The authors identify significant theoretical and empirical limitations in existing methods aimed at approximating self-attention with linear complexity. The dependency on softmax for normalizing the scaled dot-product between tokens is highlighted as a core issue. In this context, the authors present SOFT, a range of models employing a Gaussian kernel function in lieu of the conventional dot-product similarity. This substitution enables a comprehensive self-attention matrix approximation leveraging low-rank matrix decomposition.

Notably, the computational robustness of the proposed method is fortified through the estimation of the Moore-Penrose inverse using an iterative Newton-Raphson process during the forward phase. This innovative approach significantly improves computational efficiency for ViT variants, allowing longer token sequences with linear complexity while optimizing the trade-off between accuracy and complexity.

Numerical Results and Empirical Validation

The extensive experimental evaluation of SOFT models on benchmarks such as ImageNet, COCO, and ADE20K demonstrates substantial improvements in computational efficiency and model accuracy compared to existing ViT variants. Numerical results illustrate that SOFT models not only accommodate longer image token sequences but also outperform state-of-the-art both CNNs and ViTs across several visual recognition tasks. Figures presented in the paper, such as comparisons of top-1 classification accuracies and memory usage, substantiate these claims.

Implications and Future Directions

This research offers both practical and theoretical advancements in the domain of ViTs. The intrinsic design of softmax-free self-attention, underscored by Gaussian kernel application, mitigates the traditional limitations posed by softmax normalization in linearizing transformer complexity. This innovation has profound implications for the scalability of ViT architectures, particularly when tackling high-resolution visual inputs.

Theoretically, the paper paves the way for new explorations in kernel-based attention mechanisms within transformers, presenting an opportunity to further refine and optimize the computational paradigms underpinning large-scale vision models. The iterative estimation technique's deployment, albeit effective, introduces a prospective exploration into newer, more computationally intelligent algorithms for matrix inversion. Moreover, the asymmetric trade-offs in training speed versus inference gain observed in SOFT variants suggest fruitful investigations into parallel and distributed computing approaches tailored for these models.

The future of AI, particularly in the domain of image-based recognition, holds promise as the hardware and algorithmic components continue to evolve. Integrating SOFT mechanisms potentially broadens the accessibility of high-throughput, accuracy-intensive models to platforms otherwise constrained by hardware limitations, representing a substantial leap toward real-time, resource-efficient visual recognition systems.

In conclusion, the contribution of "Softmax-free Linear Transformers" is both timely and invaluable, offering a distinct pathway for overcoming inherent transformer limitations through innovative softmax-free attention mechanisms and establishing a robust foundation for future developments in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (9)
  1. Bello I (2021) Lambdanetworks: Modeling long-range interactions without attention. In: International Conference on Learning Representations
  2. Ben-Israel A, Cohen D (1966) On iterative computation of generalized inverses and associated projections. SIAM Journal on Numerical Analysis 3(3):410–419
  3. Fasshauer GE (2011) Positive definite kernels: past, present and future. Dolomites Research Notes on Approximation 4:21–63
  4. Krizhevsky A (2009) Learning multiple layers of features from tiny images. URL https://api.semanticscholar.org/CorpusID:18268744
  5. Mindspore (2020) https://www.mindspore.cn/
  6. Von Luxburg U (2007) A tutorial on spectral clustering. Statistics and computing 17:395–416
  7. Wightman R (2019) Pytorch image models. https://github.com/rwightman/pytorch-image-models
  8. Williams C, Seeger M (2000) Using the nyström method to speed up kernel machines. Advances in Neural Information Processing Systems 13
  9. Yoshida Y, Miyato T (2017) Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:170510941
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jiachen Lu (16 papers)
  2. Junge Zhang (47 papers)
  3. Xiatian Zhu (139 papers)
  4. Jianfeng Feng (57 papers)
  5. Tao Xiang (324 papers)
  6. Li Zhang (690 papers)
Citations (7)