RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale (2505.03005v2)

Published 5 May 2025 in cs.CL, cs.AI, and cs.LG

Abstract: We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than \$2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper

Summary

An Analysis of "RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale"

The paper "RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale" explores the optimization and conversion of softmax attention transformers into more computationally efficient linear attention models. The paper introduces the RADLADS protocol and demonstrates its efficacy in transforming models such as the Qwen2.5 through a meticulous process involving three main steps: attention weight transfer, attention hidden state alignment, and knowledge distillation. Additionally, the research presents new architectures, RAD-RWKV6 and RAD-RWKV7, which facilitate the conversion and enhance the inference speed without compromising performance.

Methodology and Results

The RADLADS protocol is designed to minimize the computational burden associated with training large-scale LLMs. The outlined method requires only 350-700 million tokens, reflecting a minimal fraction (less than 0.005%) of the tokens necessary for training the original teacher models from scratch. This efficiency translates into a significant reduction in cost, approximately under $2,000 for converting a 72B parameter model, while maintaining an inference quality close to the source transformer. The resultant models demonstrated state-of-the-art performance across standard benchmarks for linear attention models, a noteworthy achievement outlined in the paper’s comparative analyses.

Architectural Innovations and Implications

The RAD-RWKV6 ("RADFinch") and RAD-RWKV7 ("RADGoose") architectures represent evolutionary improvements over existing RWKV structures. By integrating elements such as a Gated Linear Attention kernel and streamlining features like tokenshift, these novel architectures optimize the distillation process from softmax to linear attention. Notably, these innovations allow converted models to perform inference more rapidly than if traditional RWKV designs were employed, marking a significant leap in the practicality of using linear attention for extensive sequence modeling tasks.

Performance Evaluation

The research provides comprehensive benchmarking data, juxtaposing RADLADS-converted models against other contemporary approaches. In terms of accuracy ratio across various benchmarks like Lambada, MMLU, and others, RADLADS consistently outperforms its counterparts. Particularly, the QRWKV6 and QRWKV7 variants exhibit impressive accuracy on knowledge and reasoning benchmarks, evidencing the effectiveness of the distillation and conversion process.

Speculations and Future Directions

The results suggest promising avenues for the future application of linear attention architectures, potentially facilitating broader accessibility to high-performance LLMs by reducing operational costs and compute requirements. The approach also opens possibilities for developing novel attention mechanisms that align closer with human-like reasoning, given the inherent efficiency and scalability of linear models. Continued exploration into dataset optimization and further refinement of architectural details could yield even greater performance improvements, helping bridge the gap between resource-heavy research environments and practical, real-world AI applications.

Conclusion

The paper “RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale” outlines a robust framework for transitioning from softmax to linear attention models efficiently. With its combined architectural advancements and streamlined distillation protocol, RADLADS sets a precedent for future research in model efficiency and scalability, suggesting that the future of AI model deployment could very well depend on innovations in attention mechanisms and their efficient implementation.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (4)

GitHub

GitHub - recursal/RADLADS-paper: RADLADS training code

Tweets

https://twitter.com/picocreator/status/1922009300441891165

https://twitter.com/HuggingPapers/status/1920148241871519907

https://twitter.com/QuantvmH/status/1922832839743054171

https://twitter.com/HarrisonVander1/status/1919931498959958259

https://twitter.com/jamieconnor/status/1937827834363630012

https://twitter.com/norxornor/status/1919934184979591594

YouTube

Show All Videos