Focused Transformer: Contrastive Training for Context Scaling (2307.03170v2)
Abstract: LLMs have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an external memory, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of $3B$ and $7B$ OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a $256 k$ context length for passkey retrieval.
- Colt5: Faster long-range transformers with conditional computation. CoRR, abs/2303.09752, 2023. doi: 10.48550/arXiv.2303.09752. URL https://doi.org/10.48550/arXiv.2303.09752.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/D13-1160.
- Improving language models by retrieving from trillions of tokens. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR, 2022. URL https://proceedings.mlr.press/v162/borgeaud22a.html.
- Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
- Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
- Extending context window of large language models via positional interpolation, 2023.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html.
- A dataset of information-seeking questions and answers anchored in research papers, 2021.
- Sigmoid-weighted linear units for neural network function approximation in reinforcement learning, 2017.
- Representation degeneration problem in training natural language generation models. arXiv preprint arXiv:1907.12009, 2019.
- A framework for few-shot language model evaluation, September 2021a. URL https://doi.org/10.5281/zenodo.5371628.
- Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983, 2021b.
- Xinyang Geng. Easylm: A simple and scalable training framework for large language models, 2023. URL https://github.com/young-geng/EasyLM.
- Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
- Longt5: Efficient text-to-text transformer for long sequences. arXiv preprint arXiv:2112.07916, 2021.
- Structured prompting: Scaling in-context learning to 1, 000 examples. CoRR, abs/2212.06713, 2022. doi: 10.48550/arXiv.2212.06713. URL https://doi.org/10.48550/arXiv.2212.06713.
- Transformer language models without positional encodings still learn positional information, 2022.
- Query-key normalization for transformers. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 4246–4253. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.findings-emnlp.379. URL https://doi.org/10.18653/v1/2020.findings-emnlp.379.
- Toward semantics-based answer pinpointing. In Proceedings of the First International Conference on Human Language Technology Research, 2001. URL https://www.aclweb.org/anthology/H01-1069.
- Contraclm: Contrastive learning for causal language model, 2023.
- Thor: Wielding hammers to integrate language models and automated theorem provers. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=fUeOyt-2EOp.
- Billion-scale similarity search with gpus, 2017.
- No train no gain: Revisiting efficient training algorithms for transformer-based language models, 2023.
- kaiokendev. Things iḿ learning while training superhot. https://kaiokendev.github.io/til#extending-context-to-8k, 2023.
- Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
- The stack: 3 tb of permissively licensed source code. Preprint, 2022.
- SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012.
- Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=IFXTZERXdM7.
- Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002. URL https://www.aclweb.org/anthology/C02-1150.
- Competition-level code generation with alphacode. CoRR, abs/2203.07814, 2022. doi: 10.48550/arXiv.2203.07814.
- Efficient training of retrieval models using negative cache. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 4134–4146. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/2175f8c5cd9604f6b1e576b252d4c86e-Paper.pdf.
- Dacheng Li Rulin Shao Anze Xie Ying Sheng Lianmin Zheng Joseph E. Gonzalez Ion Stoica Xuezhe Ma and Hao Zhang. How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat.
- Magnushammer: A transformer-based approach to premise selection, 2023.
- Landmark attention: Random-access infinite context length for transformers. CoRR, abs/2305.16300, 2023. doi: 10.48550/arXiv.2305.16300. URL https://doi.org/10.48550/arXiv.2305.16300.
- MosaicML. Introducing mpt-30b: Raising the bar for open-source foundation models, 2023. URL www.mosaicml.com/blog/mpt-30b. Accessed: 2023-06-22.
- Hierarchical transformers are more efficient language models. CoRR, abs/2110.13711, 2021. URL https://arxiv.org/abs/2110.13711.
- Efficient transformers with dynamic token pooling, 2023.
- Long sequence modeling with xgen: A 7b llm trained on 8k input sequence length. Salesforce AI Research Blog, 2023. URL https://blog.salesforceairesearch.com/xgen-7b/.
- Generative language modeling for automated theorem proving. CoRR, abs/2009.03393, 2020. URL https://arxiv.org/abs/2009.03393.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Compressive transformers for long-range sequence modelling. arXiv preprint, 2019. URL https://arxiv.org/abs/1911.05507.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019a.
- Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, October 2019b.
- Parallel context windows for large language models, 2023.
- Scrolls: Standardized comparison over long language sequences, 2022.
- Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
- Do long-range language models actually use long-range context?, 2021.
- TogetherComputer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
- Memorizing transformers. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=TrjbxzRcnf-.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
- Root mean square layer normalization, 2019.
- Training language models with memory augmentation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 5657–5673. Association for Computational Linguistics, 2022. URL https://aclanthology.org/2022.emnlp-main.382.
- Szymon Tworkowski (7 papers)
- Konrad Staniszewski (6 papers)
- Mikołaj Pacek (2 papers)
- Yuhuai Wu (49 papers)
- Henryk Michalewski (42 papers)
- Piotr Miłoś (52 papers)