Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RWKV: Reinventing RNNs for the Transformer Era (2305.13048v2)

Published 22 May 2023 in cs.CL and cs.AI
RWKV: Reinventing RNNs for the Transformer Era

Abstract: Transformers have revolutionized almost all NLP tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.

RWKV: Reinventing RNNs for the Transformer Era

The landscape of NLP has been dramatically reshaped by the advent of Transformer models, with their self-attention mechanism enabling unparalleled advancements in various tasks. Despite their success, Transformers come with intrinsic limitations, most notably their quadratic computational and memory complexities concerning sequence length. On the other hand, Recurrent Neural Networks (RNNs) exhibit linear scaling but falter in performance due to non-parallelizability and scalability issues. This paper introduces a novel architecture termed Receptance Weighted Key Value (RWKV), aiming to meld the strengths of both RNNs and Transformers while mitigating their respective limitations.

Research Motivation and Approach

Transformers' impact on NLP tasks is profound, but their scalability is hindered by the quadratic complexity of their self-attention mechanism. However, the linear scaling of memory and computation in RNNs presents an alluring alternative if the performance bottleneck can be overcome. RWKV leverages linear attention mechanisms, reformulating the model to function as either a Transformer or an RNN. This dual functionality allows RWKV to harness parallelizable computation during training while maintaining constant computational and memory complexity during inference.

The core architecture of RWKV integrates:

  1. Linear Attention: Reformulating attention mechanisms to achieve linear, rather than quadratic, complexity.
  2. Receptance Mechanism: Incorporating channel-directed attention to enable efficient handling of long-range dependencies.
  3. Parallelizable Training: Leveraging Transformer-like parallel training.
  4. Efficient Inference: Utilizing RNN-like constant-complexity inference.

Experimental Validation

RWKV models, scaled up to 14 billion parameters, exhibit performance parity with similarly-sized Transformers, demonstrating RWKV's competitive edge without the quadratic scaling drawback. Specific evaluations across twelve NLP tasks, such as ARC Challenge and LAMBADA, illustrate RWKV's efficient performance, as encapsulated in model architecture diagrams and result plots. These results underscore RWKV's potential as a computationally efficient model for handling vast parameter spaces effectively.

Performance and Complexity

A pivotal advantage of RWKV lies in its computational efficiency:

  • Time and Space Complexity: Traditional Transformers operate with complexities of O(T2d)O(T^2d) and O(T2+Td)O(T^2 + Td), respectively. Conversely, RWKV boasts a complexity of O(Td)O(Td) for both time and space, thus significantly reducing the computational overhead.
  • Scalability: The models ranging from 169 million to 14 billion parameters trained on extensive datasets demonstrate effective scaling without prohibitive computational costs.

Future Directions and Implications

RWKV's innovative architecture bridges a crucial gap between computational efficiency and representational capacity, presenting a framework that could potentially redefine AI models' scalability in sequence processing. These sustainable and cost-effective models enable broader deployment in resource-constrained environments, heralding significant implications for both practical applications and theoretical research.

Speculative Developments

Looking forward, enhancements in RWKV could include:

  • Improving Time-Decay Formulations: Refining the mechanisms that dictate the relevance of past information.
  • Cross-Attention Substitution: Replacing traditional cross-attention mechanisms in encoder-decoder architectures with RWKV-style computations.
  • Customizability through Prompt Tuning: Exploring the manipulation of hidden states to refine behavior predictability and model interpretability.
  • Expanded State Memory: Increasing the internal state capacity to enhance long-range dependency modeling.

Conclusions

RWKV represents a significant advancement in neural network design, uniting RNN and Transformer advantages while curtailing their limitations. By reformulating attention and leveraging channel-directed mechanisms, RWKV achieves linear computational complexity, making it a compelling choice for large-scale sequence processing tasks. This contribution lays a foundation for more efficient and sustainable AI models, potentially transforming how we approach and deploy large-scale AI systems.

In conclusion, RWKV paves the way for the next generation of efficient and scalable neural architectures, striking a critical balance between performance and computational feasibility. This architecture's ability to manage extensive parameter spaces with constrained resources portends a promising avenue for future developments within the NLP and broader AI communities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (95)
  1. Recasting self-attention with holographic reduced representations. arXiv preprint arXiv:2305.19534.
  2. FETA: A benchmark for few-sample task transfer in open-domain dialogue. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10936–10953, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  3. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all.
  4. Anonymous. 2023. Sharegpt_vicuna_unfiltered.
  5. Layer normalization.
  6. TweetEval: Unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644–1650, Online. Association for Computational Linguistics.
  7. Longformer: The long-document transformer. arXiv:2004.05150.
  8. Datasheet for the pile. arXiv preprint arXiv:2201.07311.
  9. Emergent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158.
  10. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373.
  11. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
  12. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow. URL: https://doi. org/10.5281/zenodo, 5297715.
  13. Gpt-neox-20b: An open-source autoregressive language model. In Proceedings of BigScience Episode\normal-\\backslash\# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136.
  14. Quasi-recurrent neural networks. In ICLR.
  15. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  16. Scaling transformer to 1m tokens and beyond with rmt.
  17. Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079–11091.
  18. Sahil Chaudhary. 2023. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca.
  19. Joseph Cheung. 2023. Guanacodataset.
  20. Rethinking attention with performers.
  21. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
  22. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Deep Learning and Representation Learning Workshop.
  23. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.
  24. Think you have solved question answering? try arc, the ai2 reasoning challenge. In arXiv:1803.05457.
  25. Training verifiers to solve math word problems. In arXiv, volume abs/2110.14168.
  26. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems.
  27. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052.
  28. Goemotions: A dataset of fine-grained emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 4040–4054. Association for Computational Linguistics.
  29. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  30. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396.
  31. Efficiently modeling long sequences with structured state spaces. In The International Conference on Learning Representations (ICLR).
  32. Longt5: Efficient text-to-text transformer for long sequences. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 724–736.
  33. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994.
  34. Identity mappings in deep residual networks.
  35. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701.
  36. Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116.
  37. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
  38. Training compute-optimal large language models.
  39. Deep learning for time series classification: a review. Data mining and knowledge discovery, 33(4):917–963.
  40. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR.
  41. Mnnfast: A fast and scalable system architecture for memory-augmented neural networks. In Proceedings of the 46th International Symposium on Computer Architecture, pages 250–263.
  42. Belle: Be everyone’s large language model engine. https://github.com/LianjiaTech/BELLE.
  43. Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases. arXiv preprint arXiv:2303.14742.
  44. Matt Gardner Johannes Welbl Nelson F. Liu. 2017. Crowdsourcing multiple choice science questions. In DOI:10.18653/v1/W17-4413.
  45. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589.
  46. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  47. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR.
  48. Reformer: The efficient transformer. ArXiv, abs/2001.04451.
  49. Chatgpt: Jack of all trades, master of none. Information Fusion, page 101861.
  50. Multi-level sentiment analysis of polemo 2.0: Extended corpus of multi-domain consumer reviews. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 980–991.
  51. Phong Le and Willem Zuidema. 2016. Quantifying the vanishing gradient and long distance dependency problem in recursive neural networks and recursive lstms. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 87–93.
  52. What language model to train if you have one million gpu hours? In Proceedings of BigScience Episode #5–Workshop on Challenges & Perspectives in Creating Large Language Models.
  53. Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4470–4481, Brussels, Belgium. Association for Computational Linguistics.
  54. Pay attention to mlps.
  55. Luna: Linear unified nested attention. Advances in Neural Information Processing Systems, 34:2441–2453.
  56. Mega: Moving average equipped gated attention. In ICLR.
  57. Eric Martin and Chris Cundy. 2017. Parallelizing linear recurrent neural nets over sequence length. ArXiv, abs/1709.04057.
  58. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36.
  59. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
  60. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264.
  61. OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
  62. Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349.
  63. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, Berlin, Germany. Association for Computational Linguistics.
  64. Pytorch: An imperative style, high-performance deep learning library.
  65. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866.
  66. Six attributes of unhealthy conversations. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 114–124, Online. Association for Computational Linguistics.
  67. Markus N. Rabe and Charles Staats. 2022. Self-attention does not need o⁢(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory.
  68. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA. Association for Computing Machinery.
  69. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI.
  70. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  71. Ramsha Siddiqui. 2019. SARCASMANIA: Sarcasm Exposed! http://www.kaggle.com/rmsharks4/sarcasmania-dataset. [Online; accessed 02-February-2023].
  72. Primer: Searching for efficient transformers for language modeling. CoRR, abs/2109.08668.
  73. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  74. Synthesizer: Rethinking self-attention in transformer models.
  75. Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations.
  76. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28.
  77. Mlp-mixer: An all-mlp architecture for vision. CoRR, abs/2105.01601.
  78. Llama: Open and efficient foundation language models.
  79. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  80. David Vilares and Carlos Gómez-Rodríguez. 2019. Head-qa: A healthcare dataset for complex reasoning. In ACL.
  81. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  82. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
  83. Linformer: Self-attention with linear complexity.
  84. Emergent abilities of large language models. ArXiv, abs/2206.07682.
  85. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  86. Transformers: State-of-the-Art Natural Language Processing. pages 38–45. Association for Computational Linguistics.
  87. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24.
  88. Ex machina: Personal attacks seen at scale. In Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3-7, 2017, pages 1391–1399. ACM.
  89. Jianxin Yang. 2023. Firefly. https://github.com/yangjianxin1/Firefly.
  90. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33.
  91. Hellaswag: Can a machine really finish your sentence? In ACL.
  92. Winogrande: An adversarial winograd schema challenge at scale. In ACL.
  93. An attention free transformer.
  94. Record: Bridging the gap between human and machine commonsense reading comprehension. In arXiv:1810.12885.
  95. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (34)
  1. Bo Peng (304 papers)
  2. Eric Alcaide (8 papers)
  3. Quentin Anthony (25 papers)
  4. Alon Albalak (26 papers)
  5. Samuel Arcadinho (5 papers)
  6. Stella Biderman (55 papers)
  7. Huanqi Cao (6 papers)
  8. Xin Cheng (89 papers)
  9. Michael Chung (1 paper)
  10. Matteo Grella (5 papers)
  11. Kranthi Kiran GV (3 papers)
  12. Xuzheng He (6 papers)
  13. Haowen Hou (15 papers)
  14. Jiaju Lin (11 papers)
  15. Jiaming Kong (2 papers)
  16. Bartlomiej Koptyra (4 papers)
  17. Hayden Lau (5 papers)
  18. Krishna Sri Ipsit Mantri (6 papers)
  19. Ferdinand Mom (1 paper)
  20. Atsushi Saito (4 papers)
Citations (407)
Youtube Logo Streamline Icon: https://streamlinehq.com