Linearizing Large Language Models (2405.06640v1)

Published 10 May 2024 in cs.CL

Abstract: Linear transformers have emerged as a subquadratic-time alternative to softmax attention and have garnered significant interest due to their fixed-size recurrent state that lowers inference cost. However, their original formulation suffers from poor scaling and underperforms compute-matched transformers. Recent linear models such as RWKV and Mamba have attempted to address these shortcomings by proposing novel time-mixing and gating architectures, but pre-training LLMs requires significant data and compute investments. Thus, the search for subquadratic architectures is limited by the availability of compute and quality pre-training datasets. As a cost-effective alternative to pre-training linear transformers, we propose Scalable UPtraining for Recurrent Attention (SUPRA). We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget. This allows us to leverage the strong pre-training data and performance of existing transformer LLMs, while requiring 5% of the training cost. We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest linear models. Our code and models can be found at https://github.com/TRI-ML/linear_open_lm.

PDF HTML Abstract

Understanding SUPRA: A New Approach to Linearize Pre-trained Transformers into RNNs

Overview of the Proposed Method

The freshly introduced method called Scalable UPtraining for Recurrent Attention (SUPRA) seeks a cost-effective way to transform pre-trained transformers into Recurrent Neural Networks (RNNs). This approach could potentially leverage the strengths of both architectures — the powerful pre-training of transformers and the cost-efficient inference capabilities of RNNs.

The Challenge with Linear Transformers

Conventional transformers excel due to their parallelizable nature, which provides high efficiency in training over long sequences. However, they suffer from high inference costs relative to RNNs, which maintain a fixed-size hidden state and are, as a result, generally more memory-efficient.

Despite the introduction of linear transformers as a subtype that aims to replicate the trainable advantages of the transformer while catering to memory efficiency, they typically fall short of conventional transformers on intensive Natural Language Processing benchmarks.

Enter SUPRA: A Hybrid Training Approach

SUPRA introduces a middle ground by uptraining - a process of continuing training with a modified architecture. This method begins with a well-established pre-trained transformer and subtly adjusts it to imitate an RNN during inference time.

The Process

Linearization Technique: Replace the softmax normalization commonly found in transformers with GroupNorm, a type of normalization that can help balance the output across different nodes in a network.
Positional Encoding Adjustment: Integrate a rotary positional encoding scheme to cater to disturbances often encountered with absolute positional encoding in RNNs.

SUPRA cleverly addresses an issue inherent to other linear approaches by utilizing only a small fraction of the original training tokens for uptraining. This ensures cost-effectiveness while maintaining competitive model performance.

Testing the Performance

The uptrained models were rigorously evaluated across several benchmarks:

Standard Language Benchmarks: SUPRA models displayed competitive performance against leading pretrained recurrent models using notably less data and compute resources.
Long-Context Tasks: Despite their promise, the linearized models under SUPRA showed limitations in tasks requiring extended context, underscoring a gap that still exists with conventional transformers.

The Implications and Future Prospects

The birth of SUPRA shifts the landscape for how large-scale models could be transformed to achieve efficiency without an enormous compute overhead. Practically, this could make RNNs viable again for applications where inference cost and resource efficiency are critical.

On the Theoretical Side

SUPRA shows that there's a rich vein to explore in the hybrid modeling approach, potentially setting the stage for future research focusing on optimizing these hybrid architectures.

Looking Forward

While SUPRA demonstrates a promising approach, the models' struggle with long-context tasks suggests a need for further tweaking. Innovations such as more complex gating mechanisms and alternative normalization techniques could potentially bridge the observed performance gap.

Conclusion

SUPRA presents an intriguing prospect in the quest for efficient AI modeling, offering a new toolkit for those looking to harness the strengths of transformers and RNNs alike. With continued development, SUPRA or its derivatives might soon become a staple in reducing computational costs while sustaining high performance across a range of AI tasks.