Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Long-Short Transformer: Efficient Transformers for Language and Vision (2107.02192v3)

Published 5 Jul 2021 in cs.CV, cs.CL, cs.LG, and cs.MM
Long-Short Transformer: Efficient Transformers for Language and Vision

Abstract: Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive LLMing, and ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3x as long sequences compared to its full-attention version on the same hardware. On ImageNet, it can obtain the state-of-the-art results (e.g., a moderate size of 55.8M model solely trained on 224x224 ImageNet-1K can obtain Top-1 accuracy 84.1%), while being more scalable on high-resolution images. The source code and models are released at https://github.com/NVIDIA/transformer-ls .

A Comprehensive Overview of Long-Short Transformer: Efficient Transformers for Language and Vision

Introduction

The paper presents the Long-Short Transformer (Transformer-LS), an innovative architectural approach crafted to address inherent inefficiencies in self-attention mechanisms when applied to long sequences in both language and vision tasks. Transformer-based models, while markedly successful across domains of natural language processing and computer vision, encounter prohibitive computational complexities when scaling to lengthy input sequences. This paper introduces a model that substantially reduces such barriers by integrating long-range and short-term attention mechanisms with linear complexity, thereby optimizing performance across various tasks.

Methodology

The authors propose the Transformer-LS, featuring a sophisticated integration of two types of attention processes:

  1. Long-Range Attention via Dynamic Projection: This aspect of the Transformer-LS employs a dynamic low-rank projection to model distant correlations between tokens efficiently. The crucial innovation here is the replacement of fixed low-rank projections, as employed in methods like Linformer, with projections that dynamically adapt to the content of input sequences. This adaptation yields robustness against semantic-preserving positional variations, enhancing efficacy especially for inputs with high sequence length variance.
  2. Short-Term Attention via Sliding Window: The sliding window attention mechanism enables the capture of fine-grained local correlations. By implementing a segment-wise approach, this method ensures effective scaling and computational efficiency.

Moreover, a dual normalization strategy (DualLN) is introduced to mitigate scale mismatch between embeddings from the long and short-term attention mechanisms, thus improving the aggregation of these two complementary processes.

Results

  • Numerical Evaluation: Transformer-LS demonstrates improved performance over state-of-the-art models across multiple benchmarks. For example, in autoregressive LLMing on enwik8, it achieves a test BPC of 0.97 using half the parameters of previous methods, while being faster and handling sequences three times longer on equivalent hardware.
  • Bidirectional and Autoregressive Models: The model shows competitive performance on both types of models, handling tasks like Long Range Arena benchmarks, autoregressive LLMing, and ImageNet classification with enhanced efficacy.
  • Robustness and Scalability: The transformer displays impressive robustness against perturbations and positional changes. Additionally, it shows scalability to high-resolution images, achieving a top-1 accuracy of 84.1% on ImageNet-1K.

Implications

The theoretical implications of this work are significant, hinting at a profound evolution in the design of transformer architectures. The dynamic projection method not only provides computational savings but also enhances representation robustness, a critical need for tasks involving fluctuating input lengths. In practical terms, Transformer-LS sets a new standard in handling complex, long-sequence tasks with reduced resource demands. This efficiency is crucial for real-world applications where hardware limitations persist.

Future Developments

Looking forward, the Transformer-LS model promises to impact further research in transformers by paving the way for the development of more adaptable and efficient architectures. Potential future directions include exploring its application in other domains requiring robust long-sequence modeling, such as video processing and long-form dialogue systems. Additionally, the integration of this model into more advanced, domain-specific architectures could push performance further, particularly in areas like semantic segmentation and high-resolution object detection.

In conclusion, Transformer-LS marks a significant step forward in transformer efficiency and adaptability, offering a blueprint for future architectural innovations in both NLP and computer vision domains.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chen Zhu (103 papers)
  2. Wei Ping (51 papers)
  3. Chaowei Xiao (110 papers)
  4. Mohammad Shoeybi (60 papers)
  5. Tom Goldstein (226 papers)
  6. Anima Anandkumar (236 papers)
  7. Bryan Catanzaro (123 papers)
Citations (122)
Github Logo Streamline Icon: https://streamlinehq.com