Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Luna: Linear Unified Nested Attention (2106.01540v2)

Published 3 Jun 2021 in cs.LG and cs.CL

Abstract: The quadratic computational and memory complexities of the Transformer's attention mechanism have limited its scalability for modeling long sequences. In this paper, we propose Luna, a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions, yielding only linear (as opposed to quadratic) time and space complexity. Specifically, with the first attention function, Luna packs the input sequence into a sequence of fixed length. Then, the packed sequence is unpacked using the second attention function. As compared to a more traditional attention mechanism, Luna introduces an additional sequence with a fixed length as input and an additional corresponding output, which allows Luna to perform attention operation linearly, while also storing adequate contextual information. We perform extensive evaluations on three benchmarks of sequence modeling tasks: long-context sequence modeling, neural machine translation and masked LLMing for large-scale pretraining. Competitive or even better experimental results demonstrate both the effectiveness and efficiency of Luna compared to a variety

Luna: Linear Unified Nested Attention

The paper proposes Luna, a novel approach to attention mechanisms in Transformers that seeks to address the limitations of computational and memory demands inherent in traditional methods. By approximating softmax attention through two nested linear attention functions, Luna reduces both time and space complexities from quadratic to linear, thus enhancing scalability for tasks involving long sequences.

Key Contributions

  1. Linearized Attention Mechanism: Luna introduces a linear unified nested attention mechanism. This involves two distinct linear attention functions: a pack attention function that compresses the sequence length and an unpack attention function that restores it, both contributing to a more efficient computation process without compromising contextual representation.
  2. Novel Input Representation: Luna incorporates an additional fixed-length sequence as a part of the attention mechanism. This sequence helps in maintaining adequate contextual information while allowing the attention operation to be linear.
  3. Competitive Results across Benchmarks: Experiments conducted on three key benchmarks—long-context sequence modeling, neural machine translation, and masked LLMing for pre-training—demonstrate Luna’s effectiveness. The results show that it performs competitively, or even better, compared to strong baseline models including those using full-rank attention.
  4. Release of Implementation: The paper contributes to the open research community by providing the implementation of Luna, available for use with the FairSeq library.

Experimental Analysis and Implications

The experimental results demonstrate several notable points. First, Luna achieves significant improvements in computational efficiency while maintaining, or in some cases improving, performance. The experiments on the Long Range Arena (LRA) benchmark show that Luna outperforms existing approaches like Linformer, Performer, and Reformer, particularly in processing long sequences. The inclusion of learnable contextual sequences allows Luna to effectively model variable-length sequences and autoregressive tasks, which are crucial for many real-world applications.

In neural machine translation tasks, Luna demonstrates competitive BLEU scores, performing closely to conventional Transformer models despite its leaner architecture. This suggests that Luna can serve as a practical alternative in scenarios where computational resources are limited, yet high-level performance is required.

In large-scale masked LLMing tasks, Luna presents promising results by achieving comparable performance to models like RoBERTa when trained on similar pretraining datasets. The model's efficiency in managing memory and speed further supports its potential for implementations that need to scale across vast amounts of data.

Future Directions

The innovative design of Luna offers several avenues for future research. One aspect that could be explored is the integration of Luna's nested linear attention mechanism with recurrent methods to leverage a global memory workspace across sequence segments. Another potential direction is the application of Luna to domains necessitating the processing of lengthy documents, such as document summarization and beyond.

In summary, Luna’s approach to linearizing attention mechanisms presents a viable path forward in enhancing the scalability of Transformers. Its ability to handle long sequences efficiently without a significant trade-off in performance marks an important step in the quest for more resource-efficient neural network architectures.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xuezhe Ma (50 papers)
  2. Xiang Kong (31 papers)
  3. Sinong Wang (45 papers)
  4. Chunting Zhou (36 papers)
  5. Jonathan May (76 papers)
  6. Hao Ma (116 papers)
  7. Luke Zettlemoyer (225 papers)
Citations (111)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com