Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HGRN2: Gated Linear RNNs with State Expansion (2404.07904v2)

Published 11 Apr 2024 in cs.CL

Abstract: Hierarchically gated linear RNN (HGRN, \citealt{HGRN}) has demonstrated competitive training speed and performance in LLMing while offering efficient inference. However, the recurrent state size of HGRN remains relatively small, limiting its expressiveness. To address this issue, we introduce a simple outer product-based state expansion mechanism, which significantly enlarges the recurrent state size without introducing any additional parameters. This enhancement also provides a linear attention interpretation for HGRN2, enabling hardware-efficient training. Our extensive experiments verify the advantage of HGRN2 over HGRN consistently across different settings and competitive with other recurrent models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. The hidden attention of mamba models. 2024. URL https://api.semanticscholar.org/CorpusID:268248520.
  2. State space models as foundation models: A control theoretic overview. 2024. URL https://api.semanticscholar.org/CorpusID:268681121.
  3. Zoology: Measuring and improving recall in efficient language models. arXiv:2312.04927, 2023.
  4. Simple linear attention language models balance the recall-throughput tradeoff. CoRR, abs/2402.18668, 2024. doi: 10.48550/ARXIV.2402.18668. URL https://doi.org/10.48550/arXiv.2402.18668.
  5. Hydra attention: Efficient attention with many heads. In ECCV Workshops, 2022. URL https://api.semanticscholar.org/CorpusID:252284084.
  6. Rethinking attention with performers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  7. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014. URL http://arxiv.org/abs/1412.3555.
  8. Griffin: Mixing gated linear recurrences with local attention for efficient language models. ArXiv, abs/2402.19427, 2024. URL https://api.semanticscholar.org/CorpusID:268091246.
  9. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  10. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  11. Neural turing machines. ArXiv, abs/1410.5401, 2014. URL https://api.semanticscholar.org/CorpusID:15299054.
  12. Mamba: Linear-time sequence modeling with selective state spaces. 2023.
  13. Efficiently modeling long sequences with structured state spaces. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a.
  14. Efficiently modeling long sequences with structured state spaces, 2022b.
  15. On the parameterization and initialization of diagonal state space models. ArXiv, abs/2206.11893, 2022c. URL https://api.semanticscholar.org/CorpusID:249953875.
  16. Diagonal state spaces are as effective as structured state spaces. In NeurIPS, 2022a. URL http://papers.nips.cc/paper_files/paper/2022/hash/9156b0f6dfa9bbd18c79cc459ef5d61c-Abstract-Conference.html.
  17. Diagonal state spaces are as effective as structured state spaces, 2022b.
  18. Simplifying and understanding state space models with diagonal linear rnns. ArXiv, abs/2212.00768, 2022c. URL https://api.semanticscholar.org/CorpusID:254125297.
  19. Transformer quality in linear time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  9099–9117. PMLR, 2022.
  20. Polysketchformer: Fast transformers via sketching polynomial kernels, 2023.
  21. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp.  5156–5165. PMLR, 2020.
  22. Tobias Katsch. Gateloop: Fully data-controlled linear recurrence for sequence modeling. ArXiv, abs/2311.01927, 2023.
  23. What makes convolutional models great on long sequence modeling? ArXiv, abs/2210.09298, 2022.
  24. Pay attention to mlps, 2021.
  25. Mega: Moving average equipped gated attention. CoRR, abs/2209.10655, 2022. doi: 10.48550/arXiv.2209.10655. URL https://doi.org/10.48550/arXiv.2209.10655.
  26. Huanru Henry Mao. Fine-tuning pre-trained transformers into decaying fast weights. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  10236–10242, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.697.
  27. Parallelizing linear recurrent neural nets over sequence length. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
  28. Long range language modeling via gated state spaces. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  29. Pyramidal recurrent unit for language modeling. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  4620–4630, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1491. URL https://aclanthology.org/D18-1491.
  30. Delight: Deep and light-weight transformer. In International Conference on Learning Representations, 2021. URL https://api.semanticscholar.org/CorpusID:235613336.
  31. Pointer sentinel mixture models. 5th International Conference on Learning Representations, ICLR, Toulon, France, 2017.
  32. Resurrecting recurrent neural networks for long sequences. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  26670–26698. PMLR, 2023. URL https://proceedings.mlr.press/v202/orvieto23a.html.
  33. RWKV: reinventing rnns for the transformer era. CoRR, abs/2305.13048, 2023. doi: 10.48550/ARXIV.2305.13048.
  34. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence. 2024.
  35. Random feature attention. arXiv preprint arXiv:2103.02143, 2021.
  36. ABC: Attention with bounded-memory control. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, May 2022. Association for Computational Linguistics.
  37. Recurrent linear transformers. CoRR, abs/2310.15719, 2023.
  38. The devil in linear transformer. arXiv preprint arXiv:2210.10340, 2022a.
  39. cosformer: Rethinking softmax in attention. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022b.
  40. cosformer: Rethinking softmax in attention. In ICLR, 2022c. URL https://openreview.net/forum?id=Bl8CQrx2Up4.
  41. Toeplitz neural network for sequence modeling. In The Eleventh International Conference on Learning Representations (ICLR), 2023a. URL https://openreview.net/forum?id=IxmWsm4xrua.
  42. Scaling transnormer to 175 billion parameters. arXiv preprint arXiv:2307.14995, 2023b.
  43. Hierarchically gated recurrent neural network for sequence modeling. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023c. URL http://papers.nips.cc/paper_files/paper/2023/hash/694be3548697e9cc8999d45e8d16fe1e-Abstract-Conference.html.
  44. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models. 2024.
  45. Gated fast weights for on-the-fly neural program generation. 2017. URL https://api.semanticscholar.org/CorpusID:216094255.
  46. Linear transformers are secretly fast weight programmers. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  9355–9366. PMLR, 2021.
  47. Ordered neurons: Integrating tree structures into recurrent neural networks. ArXiv, abs/1810.09536, 2018. URL https://api.semanticscholar.org/CorpusID:53034786.
  48. Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  49. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023a.
  50. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023b.
  51. Synthesizer: Rethinking self-attention in transformer models, 2021a.
  52. Long range arena : A benchmark for efficient transformers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021b. URL https://openreview.net/forum?id=qVyeW-grC2k.
  53. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, volume 139, pp.  10347–10357, July 2021.
  54. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  55. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  56. Jos van der Westhuizen and Joan Lasenby. The unreasonable effectiveness of the forget gate. CoRR, abs/1804.04849, 2018.
  57. Pretraining without attention. CoRR, abs/2212.10544, 2022.
  58. Gated linear attention transformers with hardware-efficient training. CoRR, abs/2312.06635, 2023. doi: 10.48550/ARXIV.2312.06635. URL https://doi.org/10.48550/arXiv.2312.06635.
  59. Linear attention via orthogonal memory, 2023.
  60. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry, 2024.
  61. Learning to update auto-associative memory in recurrent neural networks for improving sequence memorization. ArXiv, abs/1709.06493, 2017. URL https://api.semanticscholar.org/CorpusID:22458497.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhen Qin (105 papers)
  2. Songlin Yang (42 papers)
  3. Weixuan Sun (31 papers)
  4. Xuyang Shen (23 papers)
  5. Dong Li (429 papers)
  6. Weigao Sun (19 papers)
  7. Yiran Zhong (75 papers)
Citations (21)

Summary

Enhancing Linear RNNs with State Expansion: The Introduction of HGRN2

Introduction to HGRN2

The Hierarchically Gated Linear RNN (HGRN) architecture has previously shown promise in LLMing and efficient inference through its use of recurrent neural networks (RNNs) with linear inference complexity. However, its performance has been somewhat constrained by its relatively small recurrent state size. In a recent development, researchers have proposed HGRN2, an advancement over HGRN, which significantly increases the recurrent state size without adding extra parameters. This is achieved through an innovative outer-product-based state expansion mechanism inspired by linear attention models, enhancing both the model's expressiveness and efficiency. HGRN2 exhibits impressive performance improvements over its predecessor across several benchmarks, including LLMing, image classification, and the Long Range Arena.

Motivation and Background

The fundamental challenge addressed by HGRN2 pertains to the limitations of fixed-sized recurrent states in RNNs. To enhance the utility of these states, two main strategies are essential: utilizing data-dependent decays for selective information retention and increasing the recurrent state size. While HGRN made strides in employing data-dependent decays, its fixed state size limited performance scalability. State expansion emerges as a critical technique in overcoming this barrier, as demonstrated by several contemporary models like Mamba and LLaMa. HGRN2 builds upon these insights, focusing on state expansion to elevate model performance without compromising efficiency.

HGRN2: Key Innovations

HGRN2 introduces several significant improvements over HGRN1, detailed as follows:

  • State Expansion Through Outer Products: HGRN2 leverages a nonparametric outer-product-based mechanism to expand the recurrent state size effectively. This approach facilitates a substantial increase in state size without the need for additional parameters, thus maintaining parameter efficiency.
  • Efficient Training and Inference: Inspired by the linear attention form, HGRN2 adopts a hardware-efficient training algorithm that allows for accelerated computation without compromising model scalability or performance.
  • Robust Empirical Evaluation: Through extensive experiments across various benchmarks, HGRN2 not only outperforms HGRN1 but also achieves competitive results against state-of-the-art models, including Mamba and LLaMa architectures in LLMing.
  • Scalability and Efficiency: One of the standout features of HGRN2 is its ability to scale efficiently, as demonstrated in controlled experiments on large-scale settings. With its design, HGRN2 exhibits potential for further scalability and utility in more demanding applications.

Practical Implications and Theoretical Contributions

HGRN2’s introduction of state expansion via a simple outer product represents a nuanced shift in enhancing RNNs' capacity for LLMing and beyond. This approach underscores the untapped potential of linear RNN architectures in achieving high performance with computational efficiency. The practical implications of HGRN2 are profound, especially in applications where inference speed and model scalability are critical. Moreover, the theoretical underpinnings of HGRN2 offer fresh perspectives on harnessing the power of RNNs through methodical state expansion, setting a new benchmark for subsequent research in this domain.

Conclusion and Future Directions

HGRN2 marks a significant step forward in the evolution of RNNs, balancing the dual objectives of enhancing model expressiveness while maintaining efficiency. By addressing the limitations of its predecessor through state expansion, HGRN2 paves the way for more sophisticated and scalable RNN architectures. Future research will likely explore further optimizations in state expansion techniques and apply HGRN2’s principles to a broader range of applications, from natural language processing to complex multimodal tasks, opening up new frontiers in the field of generative AI.