Parallelizing Linear Transformers with the Delta Rule over Sequence Length (2406.06484v5)

Published 10 Jun 2024 in cs.LG and cs.CL

Abstract: Transformers with linear attention (i.e., linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention. However, these models still underperform transformers especially on tasks that require in-context retrieval. While more expressive variants of linear transformers which replace the additive update in linear transformers with the delta rule (DeltaNet) have been found to be more effective at associative recall, existing algorithms for training such models do not parallelize over sequence length and are thus inefficient to train on modern hardware. This work describes a hardware-efficient algorithm for training linear transformers with the delta rule, which exploits a memory-efficient representation for computing products of Householder matrices. This algorithm allows us to scale up DeltaNet to standard LLMing settings. We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines such as Mamba and GLA in terms of perplexity and zero-shot performance on downstream tasks. We also experiment with two hybrid models which combine DeltaNet layers with (1) sliding-window attention layers every other layer or (2) two global attention layers, and find that these hybrids outperform strong transformer baselines.

PDF HTML Abstract

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

The paper "Parallelizing Linear Transformers with the Delta Rule over Sequence Length" introduces a novel algorithmic approach for training linear transformers characterized by the Delta rule, specifically designed to enhance computational efficiency on modern hardware infrastructures. The underlying goal of this research is to address key limitations associated with linear transformers, particularly their underperformance relative to conventional transformer architectures in recall-intensive tasks, which are crucial for practical applications such as retrieval-augmented generation.

Introduction and Challenges

Traditional transformer models rely on softmax attention operations, known for their substantial computational costs due to quadratic dependence on sequence length. Linear transformers attempt to mitigate this by substituting the exponential kernel with a dot-product mechanism, forming what can be conceptualized as a Linear RNN with matrix-valued hidden states. Notwithstanding their reduced computational complexity, linear transformers frequently fall short in tasks necessitating strong associative recall, where models like DeltaNet have shown potential promise.

Previous implementations of DeltaNet exhibited hardware inefficiency due to their sequential processing nature, impeding their scalability to larger datasets and models. The proposed solution reparameterizes DeltaNet's operation as a matrix-valued RNN employing a generalized Householder transformation, thereby facilitating parallelization along sequence lengths and optimizing memory consumption by leveraging the WY representation to manage Householder matrix products.

Methodology

The authors design a hardware-efficient approach by parallelizing the forward and backward pass computations across sequence length, effectively generalizing a model class that engages associative matrix-valued recurrence and outer-product-based addition. This is achieved through reparameterizing DeltaNet to use the WY representation, which obviates the need for holding large matrix-sized hidden states in memory, thereby offering an efficient mechanism for implementing recall-oriented updates. The paper provides detailed derivations demonstrating how this parallelization can be accomplished without compromising the model’s theoretical and computational integrity.

Experimental Results

The empirical evaluation of DeltaNet, scaled to a LLMing context involving 1.3B parameters trained on 100B tokens, demonstrates superior performance compared to recent linear-time models like Mamba and GLA. Perplexity reductions and enhanced zero-shot performance were observed, particularly in recall-intensive benchmarks such as MQAR, RegBench, and MAD. Furthermore, the paper investigates hybrid model architectures merging DeltaNet layers with sliding and global attention mechanisms, finding these configurations surpass robust transformer baselines in several empirical measures.

Implications and Future Directions

The innovation in parallelizing the DeltaNet architecture has significant implications for the practical scalability of linear transformers, particularly in advancing their application in extensive LLMing tasks that include associative recall capabilities. This approach highlights a potential pathway to more efficient autoregressive sequence transformations across sequence lengths, serving as a promising paradigm for transforming state-space modeling strategies.

Future research could delve into expanding the parameterization of the recall mechanism beyond the existing DeltaNet framework. The exploration of alternative structured matrices and associative operations within this newly established framework could augment both the capacity and efficiency of linear transformer models.

In conclusion, this paper successfully bridges the computational constraints of linear transformers with the demanding requirements of contemporary LLMing tasks, laying a foundation for further refinements and innovations in transformer architectures with linear attention mechanisms.