Advancing Regular Language Reasoning in Linear Recurrent Neural Networks (2309.07412v2)
Abstract: In recent studies, linear recurrent neural networks (LRNNs) have achieved Transformer-level performance in natural language and long-range modeling, while offering rapid parallel training and constant inference cost. With the resurgence of interest in LRNNs, we study whether they can learn the hidden rules in training sequences, such as the grammatical structures of regular language. We theoretically analyze some existing LRNNs and discover their limitations in modeling regular language. Motivated by this analysis, we propose a new LRNN equipped with a block-diagonal and input-dependent transition matrix. Experiments suggest that the proposed model is the only LRNN capable of performing length extrapolation on regular language tasks such as Sum, Even Pair, and Modular Arithmetic. The code is released at \url{https://github.com/tinghanf/RegluarLRNN}.
- Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. Association for Computational Linguistics.
- Noam Chomsky. 1956. Three models for the description of language. IRE Transactions on information theory, 2(3):113–124.
- Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations.
- Jeffrey L Elman. 1990. Finding structure in time. Cognitive science, 14(2):179–211.
- Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations.
- Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations.
- How to train your HIPPO: State space models with generalized orthogonal basis projections. In International Conference on Learning Representations.
- Diagonal state spaces are as effective as structured state spaces. In Advances in Neural Information Processing Systems.
- Liquid structural state-space models. In The Eleventh International Conference on Learning Representations.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Michael I Jordan. 1997. Serial order: A parallel distributed processing approach. In Advances in psychology, volume 121, pages 471–495. Elsevier.
- Transformers learn shortcuts to automata. In International Conference on Learning Representations.
- Eric Martin and Chris Cundy. 2018. Parallelizing linear recurrent neural nets over sequence length. In International Conference on Learning Representations.
- Resurrecting recurrent neural networks for long sequences. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 26670–26698. PMLR.
- Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048.
- Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043–28078. PMLR.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations.
- Benchmarking compositionality with formal languages. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6007–6018, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Attention is all you need. Advances in neural information processing systems, 30.