ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer (2501.15570v1)

Published 26 Jan 2025 in cs.CL

Abstract: As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at \href{https://github.com/yynil/RWKVInside}{https://github.com/yynil/RWKVInside}, \href{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.

PDF Abstract

An Analysis of the ARWKV Model: RNN-Attention LLM Integration

The paper "ARWKV: Pretrain is not what we need, an RNN-Attention-Based LLM Born from Transformer" presents a significant exploration into the synergy between Recurrent Neural Networks (RNNs) and attention mechanisms traditionally associated with Transformer architectures. It explores the development and implications of the ARWKV model, which combines aspects of both RNNs and Transformers. The authors propose that the expressive power of attention mechanisms can be successfully integrated into RNN-based architectures, offering an alternative to traditional Transformer models, especially for contexts requiring efficient state-tracking performance.

Architectural Insights

The paper begins by acknowledging the recent advancements in Linear RNNs (LRNNs) and their potential as competitive alternatives to Transformers. Notably, the RWKV-7 architecture has demonstrated notable state-tracking capabilities, surpassing some conventional Transformer models. ARWKV builds upon this foundation by integrating RWKV-7 attention mechanisms into the Transformer model's architecture.

The paper proposes a novel attention mechanism: the Time Mixing Module, derived from RWKV-7, which is integrated into the existing self-attention framework of Transformers. This transition reflects a deliberate attempt to merge the high-dimensional expressiveness of Transformers with the sequence coherence and efficiency of RNNs. The Time Mixing Module serves to address the quadratic complexity associated with attention mechanisms, promising more scalable computational performance.

Methodological Approach

The authors detail a multi-stage distillation process aimed at transferring knowledge from Transformer-based models to the ARWKV framework. Stage 1 focuses on aligning the hidden state outputs between the student (RNN-based ARWKV) and the teacher (Transformer) models. This is achieved without necessitating the initialization of the state attention from the teacher's model, hence streamlining the distillation process.

Stage 2 employs knowledge distillation techniques, leveraging a divergence-based strategy to reconcile probability distributions between teacher and student models. The use of word-level KL-Divergence, as opposed to sequence-level, is emphasized for its effectiveness in model distillation. The process accommodates multi-billion parameter models, efficiently distilling a 32B parameter model to a smaller 7B model.

Evaluation and Results

Evaluation results reveal that the model's ability to learn from both Transformers and RNNs introduces flexibility in context length handling and model efficiency while preserving performance benchmarks. The model attempts various configurations, comparing the effects of freezing different layers and utilizing gates in the architecture. Interestingly, the paper reports an unusual enhancement in model inference performance when switching from BF16 to FP16 precision, highlighting a unique characteristic of the RWKV implementation.

Table 1 in the paper outlines the benchmarks obtained post Stage 2 of training, showcasing notable results across tasks such as MMLU, Squad, and WinoGrande. The authors observe performance discrepancies when distilling from larger models (32B) using a smaller model's MLP capacity, indicating architectural challenges that merit further exploration.

Implications and Future Directions

The ARWKV's development underscores the potential for RNN-based models to gain from Transformer-derived attention mechanisms, effectively broadening the research architecture spectrum. The juxtaposition of these two approaches presents promising pathways for enhancing model efficiency, particularly in token and resource constraints found in certain applications.

Future work, as envisaged by the authors, involves refining post-training strategies to mirror advanced reasoning capabilities found in existing deep neural models such as those utilized in DeepSeek-R1 frameworks. Furthermore, generalizing the methodology across various architectures, including multi-modal and Mixture-of-Experts (MoE) systems, is anticipated to test the robustness and extensibility of their approach.

In conclusion, this paper contributes significant insights into hybrid model architectures, proposing a compelling alternative to solely Transformer or RNN models by exploiting distinct strengths inherent in each. The implications of this work have the potential to influence future architectural strategies in the ongoing development of efficient and expressive LLMs.