Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding (2106.12566v2)

Published 23 Jun 2021 in cs.LG, cs.CL, and stat.ML

Abstract: The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention. Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT). With FFT, our method achieves $\mathcal{O}(n\log n)$ time complexity. Interestingly, we further demonstrate that properly using relative positional encoding can mitigate the training instability problem of vanilla kernelized attention. On a wide range of tasks, we empirically show that our models can be trained from scratch without any optimization issues. The learned model performs better than many efficient Transformer variants and is faster than standard Transformer in the long-sequence regime.

PDF Abstract

Kernelized Attention with Relative Positional Encoding

The paper "Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding" addresses a significant issue in the efficiency of Transformer models, particularly when processing long sequences. Traditional Transformers experience quadratic complexity due to the attention module, which necessitates pairwise correlation computations between sequence positions. The research introduces a novel approach combining kernelized attention with relative positional encoding (RPE) for efficient attention calculation, significantly reducing complexity to $\mathcal{O}(n\log n)$ .

Technical Contributions

1. Kernelized Attention with RPE

The work extends kernelized attention, initially devised to approximate the dot-then-exponentiate softmax function in standard attention, to include relative positional encoding—a crucial feature in state-of-the-art models that enhances the Transformer’s ability to handle relative positions in sequences. They mathematically demonstrate that the inclusion of RPE transforms the positional-correlation matrix into a Toeplitz matrix, leveraging Fast Fourier Transform (FFT) for efficient matrix multiplication.

2. Addressing Training Instability

The researchers identify training instability issues in vanilla kernelized attention methods, and propose normalizing the queries and keys to reduce the variance in approximation errors. Unlike traditional normalization techniques, the use of RPE maintains model expressiveness, permitting sharp attention distributions—essential for model performance—without incurring large-norm penalties.

Numerical Results and Model Performance

Through empirical validation, the NPRF-Transformer with RPE is crafted to showcase enhancement over existing methods, both in training efficiency and model performance. On language pre-training tasks, it outclasses Linformer and Nystr\"omformer models, presenting superior average GLUE scores. Similarly, in LLMing benchmarks, NPRF-Transformers achieve lower perplexity scores relative to previously established kernelized and linear models.

In machine translation and image classification tasks, comparable improvements are observed, with the proposed model maintaining or exceeding BLEU and accuracy scores compared to standard Transformer architectures and other efficiency-driven alternatives. These empirical results emphasize its viable application across diverse domains, demonstrating both stability and speed-up in training with scalable performance enhancements.

Implications and Future Work

This research offers crucial insights into potential directions for Transformer architectures, suggesting that adopting kernelized attention with relative positional encoding broadly impacts not only computational efficiency but may also pioneer performance stabilization strategies in model training. Future work may explore optimizations specific to sequence generation tasks during inference and potentially explore broader adaptations of FFT techniques in attention layers, thus further extending AI systems into domains requiring processing of extensive data sequences with refined efficiency.

Overall, the paper contributes substantially to the area of efficient machine learning models, making it critical reading for researchers exploring Transformer efficiency improvements and kernel methods in deep learning.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Shengjie Luo (20 papers)
Shanda Li (15 papers)
Tianle Cai (34 papers)
Di He (108 papers)
Dinglan Peng (2 papers)
Shuxin Zheng (32 papers)
Guolin Ke (43 papers)
Liwei Wang (239 papers)
Tie-Yan Liu (242 papers)

Citations (47)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/HlibIvanov/status/1774541195105993189