Fastformer: Additive Attention Can Be All You Need (2108.09084v6)

Published 20 Aug 2021 in cs.CL

Abstract: Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on long sequences or not effective enough. In this paper, we propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even better long text modeling performance.

PDF Abstract

Fastformer: Additive Attention Can Be All You Need

The paper "Fastformer: Additive Attention Can Be All You Need" presents an advanced approach to address inefficiencies in the Transformer model, particularly its quadratic complexity with respect to input sequence length. This paper introduces Fastformer, a streamlined variant of the Transformer model leveraging additive attention to enhance efficiency in long sequence processing. The Fastformer architecture effectively reduces computational complexity to linearity while maintaining, if not improving, the performance of context modeling in large-scale language tasks.

Introduction and Motivation

Transformers have profoundly impacted NLP and various other domains like computer vision. Despite their success, standard Transformer models struggle with scaling due to their quadratic complexity related to sequence length—a challenge that becomes critical when handling long input sequences. Numerous strategies to enhance Transformer efficiency exist, including sparse attention mechanisms and low-rank approximations. However, these often either compromise on comprehensive global context modeling or remain inefficient with considerably long sequences.

The Fastformer Architecture

At the core of Fastformer is the innovative use of additive attention, diverging from the traditional pairwise interaction approach employed by self-attention in Transformers. Additive attention permits summarizing token interactions into global contexts with linear complexity.

Global Context Modeling: Fastformer summarizes the attention query matrix into a global query vector using additive attention. Subsequently, it models token interactions by performing element-wise product operations between this global query vector and the individual token's key representations, integrating these into a context-aware global key vector.
Efficient Interaction Mechanism: Through this element-wise product, Fastformer captures token-global context interactions effectively. This approach minimizes computational cost and enhances context-aware modeling, facilitating long-sequence processing.

Experimental Results and Analysis

The paper details comprehensive experiments across various benchmark datasets, encompassing tasks such as sentiment classification, topic prediction, news recommendation, and text summarization. Key findings from these experiments include:

Efficiency Metrics: Fastformer exhibits superior performance compared to other efficient Transformer models, maintaining competitive accuracy while drastically reducing computational time during training and inference.
Effectiveness: Despite its reduced complexity, Fastformer achieves results comparable or better than existing models in long text modeling, demonstrating that efficient global context modeling does not compromise performance.

Implications and Future Directions

Fastformer presents substantive implications for enhancing LLMs' efficiency, specifically pertinent for tasks demanding the processing of long sequences, such as document-level NLP applications or extensive user behavior data analysis. Its linear complexity could redefine the scalability of Transformer models in real-world applications requiring rapid processing capabilities.

Future exploration may involve leveraging Fastformer within pre-trained LLMs to further empower long-text natural language tasks robustly. Additionally, there exists potential to adapt Fastformer into various domains beyond NLP, such as e-commerce and ad prediction technologies, optimizing those systems which require comprehensive sequential data analysis.

Overall, Fastformer signifies a promising stride towards evolving Transformer architecture, balancing performance with computational resource demands efficiently. This paper lays foundational work paving the way for future developments in efficient model training strategies, fundamentally shifting how sequence data is navigated within AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Chuhan Wu (86 papers)
Fangzhao Wu (81 papers)
Tao Qi (43 papers)
Yongfeng Huang (110 papers)
Xing Xie (220 papers)

Citations (97)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos