Functional Interpolation for Relative Positions Improves Long Context Transformers (2310.04418v2)
Abstract: Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models. Though the Transformer architecture has fundamentally no limits on the input sequence lengths it can process, the choice of position encoding used during training can limit the performance of these models on longer inputs. We propose a novel functional relative position encoding with progressive interpolation, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We next empirically show that FIRE models have better generalization to longer contexts on both zero-shot LLMing and long text benchmarks.
- Colt5: Faster long-range transformers with conditional computation. arXiv preprint arXiv:2303.09752, 2023.
- Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546–38556, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Induced natural language rationales and interleaved markup tokens enable extrapolation in large language models. In Proceedings of the 1st Workshop on Mathematical Natural Language Processing (MathNLP), pp. 17–24, 2022.
- A simple and effective positional encoding for transformers. arXiv preprint arXiv:2104.08698, 2021.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
- Kerple: Kernelized relative positional embedding for length extrapolation. Advances in Neural Information Processing Systems, 35:8386–8399, 2022.
- Dissecting transformer length extrapolation via the lens of receptive field analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13522–13537, 2023.
- Learning a fourier transform for linear relative positional encodings in transformers. arXiv preprint arXiv:2302.01925, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Monotonic location attention for length generalization. arXiv preprint arXiv:2305.20019, 2023.
- Conditional positional encodings for vision transformers. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=3KWnuT-R1bh.
- On the relationship between self-attention and convolutional layers. arXiv preprint arXiv:1911.03584, 2019.
- Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988, 2019.
- Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WbxHAzkeQcn.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019.
- Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
- Location attention for extrapolation to longer sequences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 403–413, 2020.
- Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654, 2023.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Longt5: Efficient text-to-text transformer for long sequences. In Findings of the Association for Computational Linguistics: NAACL 2022, pp. 724–736, 2022.
- G-mixup: Graph data augmentation for graph classification. In International Conference on Machine Learning, pp. 8230–8248. PMLR, 2022.
- Transformer language models without positional encodings still learn positional information. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 1382–1390, 2022.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp. 694–711. Springer, 2016.
- The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, 2023.
- Rethinking positional encoding in language pre-training. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=09-528y2Fgf.
- Continual pre-training of language models. In The Eleventh International Conference on Learning Representations, 2022.
- Sharp nearby, fuzzy far away: How neural language models use context. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 284–294, 2018.
- Constituency parsing with a self-attentive encoder. arXiv preprint arXiv:1805.01052, 2018.
- The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018.
- Can vision transformers perform convolution? arXiv preprint arXiv:2111.01353, 2021.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.
- Relative positional encoding for transformers with linear complexity. In International Conference on Machine Learning, pp. 7067–7079. PMLR, 2021.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015.
- Stable, fast and accurate: Kernelized attention with relative positional encoding. Advances in Neural Information Processing Systems, 34, 2021.
- Your transformer may not be as powerful as you expect. Advances in Neural Information Processing Systems, 35:4301–4315, 2022.
- Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010.
- Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
- Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
- Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 300–325, 2021.
- Scrolls: Standardized comparison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 12007–12021, 2022.
- Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464–468, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2074. URL https://aclanthology.org/N18-2074.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Attention is all you need. In Advances in Neural Information Processing Systems, pp. 6000–6010, 2017.
- Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019a.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019b. URL https://openreview.net/forum?id=rJ4km2R5t7.
- On position embeddings in {bert}. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=onxoVA9FxMw.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34, 2021.
- Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2019.
- mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1Ddp1-Rb.
- Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pp. 11328–11339. PMLR, 2020a.
- Dialogpt: Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 270–278, 2020b.
- Complex reasoning in natural language. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pp. 11–20, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-tutorials.2. URL https://aclanthology.org/2023.acl-tutorials.2.
- Shanda Li (15 papers)
- Chong You (35 papers)
- Guru Guruganesh (23 papers)
- Joshua Ainslie (32 papers)
- Santiago Ontanon (41 papers)
- Manzil Zaheer (89 papers)
- Sumit Sanghai (15 papers)
- Yiming Yang (151 papers)
- Sanjiv Kumar (123 papers)
- Srinadh Bhojanapalli (44 papers)