Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lite Transformer with Long-Short Range Attention (2004.11886v1)

Published 24 Apr 2020 in cs.CL

Abstract: Transformer has become ubiquitous in natural language processing (e.g., machine translation, question answering); however, it requires enormous amount of computations to achieve high performance, which makes it not suitable for mobile applications that are tightly constrained by the hardware resources and battery. In this paper, we present an efficient mobile NLP architecture, Lite Transformer to facilitate deploying mobile NLP applications on edge devices. The key primitive is the Long-Short Range Attention (LSRA), where one group of heads specializes in the local context modeling (by convolution) while another group specializes in the long-distance relationship modeling (by attention). Such specialization brings consistent improvement over the vanilla transformer on three well-established language tasks: machine translation, abstractive summarization, and LLMing. Under constrained resources (500M/100M MACs), Lite Transformer outperforms transformer on WMT'14 English-French by 1.2/1.7 BLEU, respectively. Lite Transformer reduces the computation of transformer base model by 2.5x with 0.3 BLEU score degradation. Combining with pruning and quantization, we further compressed the model size of Lite Transformer by 18.2x. For LLMing, Lite Transformer achieves 1.8 lower perplexity than the transformer at around 500M MACs. Notably, Lite Transformer outperforms the AutoML-based Evolved Transformer by 0.5 higher BLEU for the mobile NLP setting without the costly architecture search that requires more than 250 GPU years. Code has been made available at https://github.com/mit-han-lab/lite-transformer.

Overview of Lite Transformer with Long-Short Range Attention

The paper presents an efficient architecture, the Lite Transformer, optimized for mobile NLP applications constrained by computational resources. It introduces a novel mechanism known as Long-Short Range Attention (LSRA), which partitionally dedicates attention heads to specifically handle local context using convolution and global context using traditional attention mechanisms. This targeted specialization is demonstrated to improve performance consistently over the traditional Transformer architecture across various NLP tasks, such as machine translation, abstractive summarization, and LLMing.

Key Contributions

  1. Long-Short Range Attention (LSRA): The paper proposes LSRA to address inefficiencies in the traditional Transformer architecture, particularly in mobile settings. By segregating the tasks of local and global context modeling, LSRA enhances the Transformer’s capacity to handle different types of dependencies with reduced computational effort.
  2. Efficiency and Performance Gains: On machine translation tasks, the Lite Transformer achieves notable improvements in BLEU scores with significantly fewer Mult-Adds. Specifically, it outperforms the baseline Transformer by 1.2 BLEU for WMT'14 English-French under 500M MACs and maintains competitive performance under tighter computational constraints.
  3. Comparison with AutoML Approaches: The manual design process of the Lite Transformer achieves a significant efficiency advantage over AutoML-derived models, such as the Evolved Transformer. For instance, it offers a 0.5 higher BLEU score without the extensive computational overhead associated with architecture search platforms that consume hundreds of GPU years.
  4. Compression Techniques: The paper also explores combining LSRA with model compression techniques like pruning and quantization, achieving an 18.2× reduction in model size with negligible degradation in BLEU score.

Experimental Validation

The authors provide comprehensive experimental evidence across multiple NLP benchmarks. The Lite Transformer not only outperforms equivalent models in machine translation but also achieves comparable results in abstractive summarization and LLMing while adhering to strict computational constraints—highlighting its versatility and adaptation to elongated inputs.

Theoretical and Practical Implications

The introduction of LSRA signifies a shift towards more specialized attention mechanisms, which could pave the way for further research into efficient and targeted architecture designs. It highlights the potential benefits of incorporating domain-specific insights into model design rather than relying on broad exploratory searches.

Practically, the Lite Transformer offers a viable solution for deploying NLP models on edge devices without compromising performance. This can facilitate real-time applications in environments where computational resources and power are limited, like mobile phones and IoT devices.

Future Directions

Future research may focus on further refining LSRA for varied NLP tasks and extending its principles to other domains where context modeling is crucial. Additionally, integrating LSRA with other efficiency-boosting techniques could enhance its deployment on even more constrained hardware platforms.

In conclusion, the Lite Transformer with Long-Short Range Attention introduces a promising direction for efficient NLP model design, demonstrating the effectiveness of merging domain-specific architectural insights with conventional model structures to optimize performance under tight resource constraints.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhanghao Wu (7 papers)
  2. Zhijian Liu (41 papers)
  3. Ji Lin (47 papers)
  4. Yujun Lin (23 papers)
  5. Song Han (155 papers)
Citations (281)
Github Logo Streamline Icon: https://streamlinehq.com