2000 character limit reached
Provable Length Generalization in Sequence Prediction via Spectral Filtering (2411.01035v1)
Published 1 Nov 2024 in cs.LG, cs.AI, and cs.CL
Abstract: We consider the problem of length generalization in sequence prediction. We define a new metric of performance in this setting -- the Asymmetric-Regret -- which measures regret against a benchmark predictor with longer context length than available to the learner. We continue by studying this concept through the lens of the spectral filtering algorithm. We present a gradient-based learning algorithm that provably achieves length generalization for linear dynamical systems. We conclude with proof-of-concept experiments which are consistent with our theory.
- Generalization on the unseen, logic reasoning and degree curriculum. In International Conference on Machine Learning, pp. 31–60. PMLR, 2023.
- Spectral state space models. arXiv preprint arXiv:2312.06837, 2023.
- Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546–38556, 2022.
- Unitary evolution recurrent neural networks. In International conference on machine learning, pp. 1120–1128. PMLR, 2016.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
- Kerple: Kernelized relative positional embedding for length extrapolation. Advances in Neural Information Processing Systems, 35:8386–8399, 2022.
- Zihang Dai. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Neural networks and the chomsky hierarchy. arXiv preprint arXiv:2207.02098, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Faith and fate: Limits of transformers on compositionality. Advances in Neural Information Processing Systems, 36, 2024.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Hippo: Recurrent memory with optimal polynomial projections. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1474–1487. Curran Associates, Inc., 2020.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021a.
- Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021b.
- Diagonal state spaces are as effective as structured state spaces. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=RjS0j6tsSrf.
- Introduction to online nonstochastic control. arXiv preprint arXiv:2211.09619, 2022.
- Efficient regret minimization in non-convex games. In International Conference on Machine Learning, pp. 1433–1441. PMLR, 2017a.
- Learning linear dynamical systems via spectral filtering. Advances in Neural Information Processing Systems, 30, 2017b.
- Spectral filtering for general linear dynamical systems. Advances in Neural Information Processing Systems, 31, 2018.
- Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
- Universal length generalization with turing programs. arXiv preprint arXiv:2407.03310, 2024.
- Length generalization in arithmetic transformers. arXiv preprint arXiv:2306.15400, 2023.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36, 2024.
- Functional interpolation for relative positions improves long context transformers. arXiv preprint arXiv:2310.04418, 2023.
- Flash stu: Fast spectral transform units. arXiv preprint arXiv:2409.10489, 2024.
- Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349, 2023.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409.
- Positional description matters for transformers arithmetic. arXiv preprint arXiv:2311.14737, 2023.
- Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023.
- Efficient transformers: A survey. ACM Comput. Surv., 55(6), dec 2022. ISSN 0360-0300. doi: 10.1145/3530811. URL https://doi.org/10.1145/3530811.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028, 2023.
- Transformers can achieve length generalization but not robustly. arXiv preprint arXiv:2402.09371, 2024.
Collections
Sign up for free to add this paper to one or more collections.