Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation (2401.16421v2)
Abstract: In this work, we leverage the intrinsic segmentation of language sequences and design a new positional encoding method called Bilevel Positional Encoding (BiPE). For each position, our BiPE blends an intra-segment encoding and an inter-segment encoding. The intra-segment encoding identifies the locations within a segment and helps the model capture the semantic information therein via absolute positional encoding. The inter-segment encoding specifies the segment index, models the relationships between segments, and aims to improve extrapolation capabilities via relative positional encoding. Theoretical analysis shows this disentanglement of positional information makes learning more effective. The empirical results also show that our BiPE has superior length extrapolation capabilities across a wide range of tasks in diverse text modalities.
- CoLT5: Faster long-range transformers with conditional computation. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Communicating hierarchical state machines. In Automata, Languages and Programming: 26th International Colloquium, ICALP’99 Prague, Czech Republic, July 11–15, 1999 Proceedings 26, pp. 169–178. Springer, 1999.
- Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546–38556, 2022.
- Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Clex: Continuous length extrapolation for large language models. arXiv preprint arXiv:2310.16450, 2023a.
- SummScreen: A dataset for abstractive screenplay summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8602–8615, May 2022.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023b.
- Kerple: Kernelized relative positional embedding for length extrapolation. Advances in Neural Information Processing Systems, 35:8386–8399, 2022.
- Dissecting transformer length extrapolation via the lens of receptive field analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13522–13537, 2023.
- Palm: Scaling language modeling with pathways, 2022.
- Monotonic location attention for length generalization. arXiv preprint arXiv:2305.20019, 2023.
- Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704, 2016.
- A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2021.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), June 2019a.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, June 2019b.
- Eilenberg, S. Automata, languages, and machines. Academic press, 1974.
- Towards revealing the mystery behind chain of thought: A theoretical perspective. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- The hierarchical hidden markov model: Analysis and applications. Machine learning, 32:41–62, 1998.
- The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- Hierarchical topic models and the nested chinese restaurant process. Advances in neural information processing systems, 16, 2003.
- Halliday’s introduction to functional grammar. Routledge, 2013.
- Lm-infinite: Simple on-the-fly length generalization for large language models, 2023.
- Transformer language models without positional encodings still learn positional information. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 1382–1390, December 2022.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1419–1436, June 2021.
- Llm maybe longlm: Self-extend llm context window without tuning. arXiv preprint arXiv:2401.01325, 2024.
- The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, 2023.
- The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6, 2018.
- ContractNLI: A dataset for document-level natural language inference for contracts. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 1907–1919, November 2021.
- RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794, September 2017.
- Functional interpolation for relative positions improves long context transformers. arXiv preprint arXiv:2310.04418, 2023.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214–3252, May 2022.
- Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022.
- Scaling laws of rope-based extrapolation. arXiv preprint arXiv:2310.05209, 2023.
- QuALITY: Question answering with long input texts, yes! In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5336–5358, July 2022.
- Yarn: Efficient context window extension of large language models, 2023.
- Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations (ICLR), 2022.
- Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations (ICLR), 2020a.
- Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2020b.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Parallel context windows for large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6383–6402, 2023.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Randomized positional encodings boost length generalization of transformers. In Association for Computational Linguistics (ACL), July 2023a.
- Randomized positional encodings boost length generalization of transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 1889–1903, July 2023b.
- Winogrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8732–8740, 2020.
- Scrolls: Standardized comparison over long language sequences. arXiv preprint arXiv:2201.03533, 2022.
- Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
- Roformer: Enhanced transformer with rotary position embedding, 2021.
- A length-extrapolatable transformer. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, July 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
- Chain of thought prompting elicits reasoning in large language models. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.
- Efficient streaming language models with attention sinks, 2023.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
- QMSum: A new benchmark for query-based multi-domain meeting summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5905–5921, June 2021.
- Pose: Efficient context window extension of llms via positional skip-wise training. arXiv preprint arXiv:2309.10400, 2023.
- Zhenyu He (57 papers)
- Guhao Feng (8 papers)
- Shengjie Luo (20 papers)
- Kai Yang (187 papers)
- Di He (108 papers)
- Jingjing Xu (80 papers)
- Zhi Zhang (113 papers)
- Hongxia Yang (130 papers)
- Liwei Wang (239 papers)