RWKV-7 "Goose" with Expressive Dynamic State Evolution (2503.14456v2)

Published 18 Mar 2025 in cs.CL, cs.AI, and cs.LG

Abstract: We present RWKV-7 "Goose", a new sequence modeling architecture with constant memory usage and constant inference time per token. Despite being trained on dramatically fewer tokens than other top models, our 2.9 billion parameter LLM achieves a new 3B SoTA on multilingual tasks and matches the current 3B SoTA on English language downstream performance. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to $\mathsf{TC}^0$. To demonstrate RWKV-7's LLMing capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. To foster openness, reproduction, and adoption, we release our models and dataset component listing at https://huggingface.co/RWKV, and our training and inference code at https://github.com/RWKV/RWKV-LM all under the Apache 2.0 License.

PDF Abstract

Overview of the RWKV-7 "Goose" Architecture

RWKV-7 "Goose" introduces a refined sequence modeling architecture that achieves competitive multilingual and English language performance at the 3B parameter scale. Crucially, it addresses the quadratic time and memory complexities inherent in Transformer-based models by maintaining constant memory usage per token and linear computational complexity with respect to sequence length. The model's architecture is underpinned by a series of innovations in state tracking and dynamic state evolution that leverage a generalized delta rule and vector-valued gating mechanisms.

Dynamic State Evolution and Its Components

RWKV-7 advances the prior RWKV framework by generalizing the delta rule to incorporate vector-valued gating and in-context learning rates. These modifications are critical for improving the model’s efficiency and state tracking capabilities:

Generalized Delta Rule The model replaces the scalar updates of traditional recurrent architectures with a vector-valued formulation. This allows channel-wise adjustments in the hidden state, enabling differential treatment of various state components. The approach permits selective updates and replacements, which are essential for handling long sequences and maintaining effective context propagation.
Vector-Valued Gating Mechanism Unlike scalar gating mechanisms that apply uniform operations across state dimensions, RWKV-7 employs a vector-valued gate. This mechanism directs individual channels within the hidden state, ensuring that only the most relevant components are modified at each time step. The channel-wise selective updating significantly enhances the model's ability to perform fine-grained state management.
In-Context Learning Rates (ICLR) The introduction of vector-valued in-context learning rates endows the architecture with the capacity to adjust update magnitudes on a per-channel basis. This mechanism is critical to maintain numerical stability, especially during long-term dependency modeling. It enables the network to adaptively control the rate at which state information “copies” versus being updated, which is fundamental for recognizing regular languages and maintaining state fidelity.

Theoretical and Empirical Advancements

Theoretical Advances

RWKV-7 claims to surpass the theoretical limitations of Transformer models, which are constrained by $\mathsf{TC^0}$ complexity under standard conjectures. By incorporating dynamic state evolution with its generalized delta rule, RWKV-7 demonstrates the ability to solve problems beyond $\mathsf{TC^0}$ . Specifically, the architecture is provably capable of recognizing all regular languages with a constant number of layers—a notable theoretical advance when compared to traditional self-attention mechanisms.

Empirical Performance

Multilingual Benchmarks:

RWKV-7 shows state-of-the-art performance on multilingual tasks at the 3B parameter scale while being trained with significantly fewer tokens than other models in the same regime. This Pareto improvement in FLOPs efficiency highlights the effectiveness of the dynamic state evolution mechanism in handling diverse linguistic data.

English Language Tasks:

Despite the reduced training token volume, the model matches contemporary state-of-the-art English language performance. This balance between data efficiency and performance is achieved by leveraging the enhanced state tracking capabilities provided by the vector-valued mechanisms.

Model Scalability and Efficiency:

The architecture’s ability to perform both state tracking and dynamic updates with constant per-token memory usage and linear time complexity is particularly beneficial for scaling to longer sequence lengths. This makes RWKV-7 a viable candidate for applications where both efficiency and performance are critical.

Implementation and Practical Considerations

Codebase and Reproducibility

The RWKV-7 models, along with the training and inference code, are publicly available. This open-source release fosters reproducibility and further research. The implementation is aligned with standard deep learning frameworks, making integration with existing pipelines straightforward. The code release under the Apache 2.0 License ensures that the models can be adapted for both research and industrial applications.

Computational Requirements

Memory Efficiency:

Due to constant memory usage per token, RWKV-7 is particularly advantageous in applications with long sequence lengths or where hardware memory constraints are a concern.

Inference Performance:

Constant inference time per token facilitates real-time applications and deployment in environments where predictable latency is crucial.

Deployment Strategies

Given its efficiency, RWKV-7 is well-suited for deployment in scenarios requiring on-device processing or cloud-based inference services where resource constraints preclude the use of quadratic-complexity models. Typical deployment pipelines can leverage the published codebase, adopting containerized environments (e.g., Docker) and standard model serving frameworks to integrate RWKV-7 into production systems.

Trade-Offs and Limitations

Capacity vs. Data Efficiency:

While RWKV-7 achieves competitive performance with fewer training tokens, practitioners must balance model capacity against the diversity and scale of the training dataset, particularly in specialized domains.

Complexity of State Evolution:

The highly dynamic state evolution mechanism, while powerful, introduces additional hyperparameters (e.g., vector-valued learning rates) that may require careful tuning. Researchers need to monitor convergence and stability, especially when extending the model to very long sequences or highly heterogeneous data distributions.

Conclusion

RWKV-7 "Goose" offers a robust alternative to Transformer architectures by significantly reducing computational overhead while maintaining state-of-the-art performance in both multilingual and English language tasks. Its incorporation of a generalized delta rule with vector-valued gating and in-context learning rates provides advanced state tracking capabilities that enable improved numerical stability and efficiency. The publicly available models and training framework further enhance its applicability, making RWKV-7 a compelling choice for both research explorations and production-level deployments where resource efficiency is paramount.