Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence (2404.05892v4)

Published 8 Apr 2024 in cs.CL and cs.AI

Abstract: We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism that improve expressivity while maintaining the inference efficiency characteristics of RNNs. We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokenizer based on greedy matching for enhanced multilinguality. We trained four Eagle models, ranging from 0.46 to 7.5 billion parameters, and two Finch models with 1.6 and 3.1 billion parameters and find that they achieve competitive performance across a wide variety of benchmarks. We release all our models on HuggingFace under the Apache 2.0 license. Models at: https://huggingface.co/RWKV Training code at: https://github.com/RWKV/RWKV-LM Inference code at: https://github.com/RWKV/ChatRWKV Time-parallel training code at: https://github.com/RWKV/RWKV-infctx-trainer

PDF Abstract

Eagle and Finch: Enhancements in Model Architecture for Efficient Sequence Processing

Introduction to RWKV Architecture

Recent developments in LLMs have focused on addressing the computational and memory inefficiencies associated with Transformer models, notably their quadratic dependency on sequence length with regards to resources. The RWKV (Receptance Weighted Key Value) architecture, initially introduced in RWKV-4, aimed to merge the strengths of RNNs and Transformers, offering a solution that scales linearly in memory and computational cost yet was capable of parallelization similar to Transformers. This paper introduces two successors to RWKV-4: Eagle (RWKV-5) and Finch (RWKV-6), which further refine and extend the original architecture, marking a significant advancement in the quest for efficient model training and performance.

Eagle/Finch Architectural Innovations

The introduction of Eagle and Finch brings forward several architectural innovations that contribute to their enhanced performance and efficiency:

Matrix-valued States with Linear Attention: Both Eagle and Finch adopt matrix-valued states in contrast to the vector-valued states of traditional RNNs. This change, coupled with a linear attention mechanism, allows for more expressive and dynamic modeling of sequences.
Dynamic Recurrence Mechanism: Finch extends this expressivity by incorporating data-dependent functions within its time-mixing and token-shift modules. Such functions enable Finch to adapt its processing based on the input sequence's context, effectively learning to attend to relevant information more accurately over time.
Scalability and Training Efficiency: Finch and Eagle can be effectively scaled up to billions of parameters due to their linear memory and computation scaling properties. This allows for the training of large-scale models without the prohibitive resource requirements of traditional Transformer models.
Versatility across Tasks: Through extensive benchmarking, Eagle and Finch have demonstrated competitive or superior performance across a wide range of sequence modeling tasks and domains, from text to music modeling. This versatility underscores the potential of RWKV-based architectures in diverse applications within NLP and beyond.

Theoretical and Practical Implications

The theoretical underpinnings of RWKV-5 and RWKV-6 provide insight into the efficient handling of long sequences. By enabling linear-time processing with dynamic data-driven adjustments, they offer a scalable alternative to Transformers without sacrificing performance. Practically, the release of Eagle and Finch models under an open license facilitates broader access to state-of-the-art tools for sequence modeling, promoting innovation and application in various fields.

Future Developments

While Eagle and Finch represent significant advancements, there is an ongoing need for research into further optimization and application of RWKV architectures. Future work may explore the integration of these models into larger, more complex systems, the refinement of their architecture for improved efficiency, and their adaptation for more specialized tasks.

Conclusion

In summary, Eagle (RWKV-5) and Finch (RWKV-6) extend the RWKV architecture to offer efficient and scalable solutions for sequence modeling. They blend the strengths of RNNs and Transformers, achieving high performance with lower resource requirements. This work not only advances the state of LLMing but also opens avenues for future research and applications leveraging these innovative model architectures.

PDF Markdown Bookmark Chat (Pro)

Authors (30)

Bo Peng (304 papers)
Daniel Goldstein (15 papers)
Quentin Anthony (25 papers)
Alon Albalak (26 papers)
Eric Alcaide (8 papers)
Stella Biderman (55 papers)
Eugene Cheah (5 papers)
Teddy Ferdinan (5 papers)
Haowen Hou (15 papers)
Przemysław Kazienko (53 papers)
Kranthi Kiran GV (3 papers)
Jan Kocoń (14 papers)
Satyapriya Krishna (27 papers)
Ronald McClelland Jr. (1 paper)
Niklas Muennighoff (56 papers)
Fares Obeid (4 papers)
Atsushi Saito (4 papers)
Guangyu Song (6 papers)
Haoqin Tu (25 papers)
Stanisław Woźniak (16 papers)

Citations (41)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/RWKV_AI/status/1780870057322807534

https://twitter.com/eric_alcaide/status/1777961152120508461

https://twitter.com/arankomatsuzaki/status/1777881529370202389

https://twitter.com/iScienceLuvr/status/1777878187348832452

https://twitter.com/_akhaliq/status/1777887213809787058

https://twitter.com/RWKV_AI/status/1830859408106192942