2000 character limit reached
HGRN2: Gated Linear RNNs with State Expansion (2404.07904v2)
Published 11 Apr 2024 in cs.CL
Abstract: Hierarchically gated linear RNN (HGRN, \citealt{HGRN}) has demonstrated competitive training speed and performance in LLMing while offering efficient inference. However, the recurrent state size of HGRN remains relatively small, limiting its expressiveness. To address this issue, we introduce a simple outer product-based state expansion mechanism, which significantly enlarges the recurrent state size without introducing any additional parameters. This enhancement also provides a linear attention interpretation for HGRN2, enabling hardware-efficient training. Our extensive experiments verify the advantage of HGRN2 over HGRN consistently across different settings and competitive with other recurrent models.
- The hidden attention of mamba models. 2024. URL https://api.semanticscholar.org/CorpusID:268248520.
- State space models as foundation models: A control theoretic overview. 2024. URL https://api.semanticscholar.org/CorpusID:268681121.
- Zoology: Measuring and improving recall in efficient language models. arXiv:2312.04927, 2023.
- Simple linear attention language models balance the recall-throughput tradeoff. CoRR, abs/2402.18668, 2024. doi: 10.48550/ARXIV.2402.18668. URL https://doi.org/10.48550/arXiv.2402.18668.
- Hydra attention: Efficient attention with many heads. In ECCV Workshops, 2022. URL https://api.semanticscholar.org/CorpusID:252284084.
- Rethinking attention with performers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014. URL http://arxiv.org/abs/1412.3555.
- Griffin: Mixing gated linear recurrences with local attention for efficient language models. ArXiv, abs/2402.19427, 2024. URL https://api.semanticscholar.org/CorpusID:268091246.
- Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Neural turing machines. ArXiv, abs/1410.5401, 2014. URL https://api.semanticscholar.org/CorpusID:15299054.
- Mamba: Linear-time sequence modeling with selective state spaces. 2023.
- Efficiently modeling long sequences with structured state spaces. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a.
- Efficiently modeling long sequences with structured state spaces, 2022b.
- On the parameterization and initialization of diagonal state space models. ArXiv, abs/2206.11893, 2022c. URL https://api.semanticscholar.org/CorpusID:249953875.
- Diagonal state spaces are as effective as structured state spaces. In NeurIPS, 2022a. URL http://papers.nips.cc/paper_files/paper/2022/hash/9156b0f6dfa9bbd18c79cc459ef5d61c-Abstract-Conference.html.
- Diagonal state spaces are as effective as structured state spaces, 2022b.
- Simplifying and understanding state space models with diagonal linear rnns. ArXiv, abs/2212.00768, 2022c. URL https://api.semanticscholar.org/CorpusID:254125297.
- Transformer quality in linear time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 9099–9117. PMLR, 2022.
- Polysketchformer: Fast transformers via sketching polynomial kernels, 2023.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp. 5156–5165. PMLR, 2020.
- Tobias Katsch. Gateloop: Fully data-controlled linear recurrence for sequence modeling. ArXiv, abs/2311.01927, 2023.
- What makes convolutional models great on long sequence modeling? ArXiv, abs/2210.09298, 2022.
- Pay attention to mlps, 2021.
- Mega: Moving average equipped gated attention. CoRR, abs/2209.10655, 2022. doi: 10.48550/arXiv.2209.10655. URL https://doi.org/10.48550/arXiv.2209.10655.
- Huanru Henry Mao. Fine-tuning pre-trained transformers into decaying fast weights. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 10236–10242, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.697.
- Parallelizing linear recurrent neural nets over sequence length. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
- Long range language modeling via gated state spaces. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- Pyramidal recurrent unit for language modeling. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4620–4630, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1491. URL https://aclanthology.org/D18-1491.
- Delight: Deep and light-weight transformer. In International Conference on Learning Representations, 2021. URL https://api.semanticscholar.org/CorpusID:235613336.
- Pointer sentinel mixture models. 5th International Conference on Learning Representations, ICLR, Toulon, France, 2017.
- Resurrecting recurrent neural networks for long sequences. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 26670–26698. PMLR, 2023. URL https://proceedings.mlr.press/v202/orvieto23a.html.
- RWKV: reinventing rnns for the transformer era. CoRR, abs/2305.13048, 2023. doi: 10.48550/ARXIV.2305.13048.
- Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence. 2024.
- Random feature attention. arXiv preprint arXiv:2103.02143, 2021.
- ABC: Attention with bounded-memory control. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, May 2022. Association for Computational Linguistics.
- Recurrent linear transformers. CoRR, abs/2310.15719, 2023.
- The devil in linear transformer. arXiv preprint arXiv:2210.10340, 2022a.
- cosformer: Rethinking softmax in attention. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022b.
- cosformer: Rethinking softmax in attention. In ICLR, 2022c. URL https://openreview.net/forum?id=Bl8CQrx2Up4.
- Toeplitz neural network for sequence modeling. In The Eleventh International Conference on Learning Representations (ICLR), 2023a. URL https://openreview.net/forum?id=IxmWsm4xrua.
- Scaling transnormer to 175 billion parameters. arXiv preprint arXiv:2307.14995, 2023b.
- Hierarchically gated recurrent neural network for sequence modeling. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023c. URL http://papers.nips.cc/paper_files/paper/2023/hash/694be3548697e9cc8999d45e8d16fe1e-Abstract-Conference.html.
- Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models. 2024.
- Gated fast weights for on-the-fly neural program generation. 2017. URL https://api.semanticscholar.org/CorpusID:216094255.
- Linear transformers are secretly fast weight programmers. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 9355–9366. PMLR, 2021.
- Ordered neurons: Integrating tree structures into recurrent neural networks. ArXiv, abs/1810.09536, 2018. URL https://api.semanticscholar.org/CorpusID:53034786.
- Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023a.
- Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023b.
- Synthesizer: Rethinking self-attention in transformer models, 2021a.
- Long range arena : A benchmark for efficient transformers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021b. URL https://openreview.net/forum?id=qVyeW-grC2k.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, volume 139, pp. 10347–10357, July 2021.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Jos van der Westhuizen and Joan Lasenby. The unreasonable effectiveness of the forget gate. CoRR, abs/1804.04849, 2018.
- Pretraining without attention. CoRR, abs/2212.10544, 2022.
- Gated linear attention transformers with hardware-efficient training. CoRR, abs/2312.06635, 2023. doi: 10.48550/ARXIV.2312.06635. URL https://doi.org/10.48550/arXiv.2312.06635.
- Linear attention via orthogonal memory, 2023.
- The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry, 2024.
- Learning to update auto-associative memory in recurrent neural networks for improving sequence memorization. ArXiv, abs/1709.06493, 2017. URL https://api.semanticscholar.org/CorpusID:22458497.
- Zhen Qin (105 papers)
- Songlin Yang (42 papers)
- Weixuan Sun (31 papers)
- Xuyang Shen (23 papers)
- Dong Li (429 papers)
- Weigao Sun (19 papers)
- Yiran Zhong (75 papers)