Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Focused Transformer: Contrastive Training for Context Scaling (2307.03170v2)

Published 6 Jul 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an external memory, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of $3B$ and $7B$ OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a $256 k$ context length for passkey retrieval.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Colt5: Faster long-range transformers with conditional computation. CoRR, abs/2303.09752, 2023. doi: 10.48550/arXiv.2303.09752. URL https://doi.org/10.48550/arXiv.2303.09752.
  2. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  3. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/D13-1160.
  4. Improving language models by retrieving from trillions of tokens. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR, 2022. URL https://proceedings.mlr.press/v162/borgeaud22a.html.
  5. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
  6. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
  7. Extending context window of large language models via positional interpolation, 2023.
  8. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  9. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  10. Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html.
  11. A dataset of information-seeking questions and answers anchored in research papers, 2021.
  12. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning, 2017.
  13. Representation degeneration problem in training natural language generation models. arXiv preprint arXiv:1907.12009, 2019.
  14. A framework for few-shot language model evaluation, September 2021a. URL https://doi.org/10.5281/zenodo.5371628.
  15. Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983, 2021b.
  16. Xinyang Geng. Easylm: A simple and scalable training framework for large language models, 2023. URL https://github.com/young-geng/EasyLM.
  17. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
  18. Longt5: Efficient text-to-text transformer for long sequences. arXiv preprint arXiv:2112.07916, 2021.
  19. Structured prompting: Scaling in-context learning to 1, 000 examples. CoRR, abs/2212.06713, 2022. doi: 10.48550/arXiv.2212.06713. URL https://doi.org/10.48550/arXiv.2212.06713.
  20. Transformer language models without positional encodings still learn positional information, 2022.
  21. Query-key normalization for transformers. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 4246–4253. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.findings-emnlp.379. URL https://doi.org/10.18653/v1/2020.findings-emnlp.379.
  22. Toward semantics-based answer pinpointing. In Proceedings of the First International Conference on Human Language Technology Research, 2001. URL https://www.aclweb.org/anthology/H01-1069.
  23. Contraclm: Contrastive learning for causal language model, 2023.
  24. Thor: Wielding hammers to integrate language models and automated theorem provers. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=fUeOyt-2EOp.
  25. Billion-scale similarity search with gpus, 2017.
  26. No train no gain: Revisiting efficient training algorithms for transformer-based language models, 2023.
  27. kaiokendev. Things iḿ learning while training superhot. https://kaiokendev.github.io/til#extending-context-to-8k, 2023.
  28. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
  29. The stack: 3 tb of permissively licensed source code. Preprint, 2022.
  30. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012.
  31. Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=IFXTZERXdM7.
  32. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002. URL https://www.aclweb.org/anthology/C02-1150.
  33. Competition-level code generation with alphacode. CoRR, abs/2203.07814, 2022. doi: 10.48550/arXiv.2203.07814.
  34. Efficient training of retrieval models using negative cache. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 4134–4146. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/2175f8c5cd9604f6b1e576b252d4c86e-Paper.pdf.
  35. Dacheng Li Rulin Shao Anze Xie Ying Sheng Lianmin Zheng Joseph E. Gonzalez Ion Stoica Xuezhe Ma and Hao Zhang. How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat.
  36. Magnushammer: A transformer-based approach to premise selection, 2023.
  37. Landmark attention: Random-access infinite context length for transformers. CoRR, abs/2305.16300, 2023. doi: 10.48550/arXiv.2305.16300. URL https://doi.org/10.48550/arXiv.2305.16300.
  38. MosaicML. Introducing mpt-30b: Raising the bar for open-source foundation models, 2023. URL www.mosaicml.com/blog/mpt-30b. Accessed: 2023-06-22.
  39. Hierarchical transformers are more efficient language models. CoRR, abs/2110.13711, 2021. URL https://arxiv.org/abs/2110.13711.
  40. Efficient transformers with dynamic token pooling, 2023.
  41. Long sequence modeling with xgen: A 7b llm trained on 8k input sequence length. Salesforce AI Research Blog, 2023. URL https://blog.salesforceairesearch.com/xgen-7b/.
  42. Generative language modeling for automated theorem proving. CoRR, abs/2009.03393, 2020. URL https://arxiv.org/abs/2009.03393.
  43. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  44. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  45. Compressive transformers for long-range sequence modelling. arXiv preprint, 2019. URL https://arxiv.org/abs/1911.05507.
  46. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019a.
  47. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, October 2019b.
  48. Parallel context windows for large language models, 2023.
  49. Scrolls: Standardized comparison over long language sequences, 2022.
  50. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
  51. Do long-range language models actually use long-range context?, 2021.
  52. TogetherComputer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  53. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  54. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
  55. Memorizing transformers. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=TrjbxzRcnf-.
  56. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
  57. Root mean square layer normalization, 2019.
  58. Training language models with memory augmentation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 5657–5673. Association for Computational Linguistics, 2022. URL https://aclanthology.org/2022.emnlp-main.382.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Szymon Tworkowski (7 papers)
  2. Konrad Staniszewski (6 papers)
  3. Mikołaj Pacek (2 papers)
  4. Yuhuai Wu (49 papers)
  5. Henryk Michalewski (42 papers)
  6. Piotr Miłoś (52 papers)
Citations (108)

Summary

An Analysis of the Focused Transformer: Context Scaling Through Contrastive Training

The research paper presents the Focused Transformer (FoT), a technique aimed at enhancing the context length of LLMs by addressing the distraction issue that arises in multi-document scenarios. The distraction issue refers to the model's difficulty in distinguishing relevant from irrelevant information as the number of documents in the context increases. The proposed method utilizes a contrastive learning-inspired training process to improve the representation of (key, value) pairs, thereby extending the effective context length of transformer models without altering their architecture.

Methodology

The FoT introduces memory attention layers that utilize k-nearest neighbors (kNN) to access additional (key, value) contexts during inference. This approach enables the model to retrieve relevant information from a large memory, effectively extending its usable context length. The memory attention is integrated differently from previous approaches, eschewing gating mechanisms in favor of simpler, and potentially more effective, methods.

The crossbatch training procedure is a significant innovation of FoT. It allows the model to differentiate between relevant and irrelevant keys by exposing the attention layers to both pertinent context and unrelated contexts (negatives). This exposure is achieved in a differentiable manner, allowing the model to fine-tune its key, value, and query structures iteratively.

Results and Discussion

The authors demonstrate the efficacy of FoT through the development of LongLLaMA models, which are fine-tuned versions of OpenLLaMA models. These improved models show significant advances in tasks requiring extended contexts, such as passkey retrieval tasks, reaching token lengths of up to 256k. In few-shot learning tasks on datasets like TREC and WebQS, LongLLaMA models exhibit marked improvements when provided with more demonstration examples within the extended context.

The paper also highlights FoT's ability to fine-tune existing models for longer contexts without new architecture modifications. This approach distinguishes FoT from other methods by leveraging the existing model's capabilities and extending them through efficient fine-tuning strategies.

Theoretical Implications

FoT addresses a critical challenge in scaling transformers for extensive contexts, namely the distraction issue. By employing contrastive learning elements, the model develops a structured key space better suited for long-context tasks. This method not only aligns with existing literature on contrastive learning but also extends its application to transformer architecture, opening avenues for future research in adaptable long-context models.

Future Directions

The scalability of FoT suggests potential integration with approximate kNN search methods for further efficiency. Moreover, exploring the combination of FoT with other long-context techniques, such as positional interpolation methods, could yield additional improvements. The crossbatch training approach could be refined by incorporating advanced contrastive learning techniques to better handle memory context structuring.

In conclusion, the Focused Transformer offers a compelling approach to extending the context length of LLMs through contrastive-inspired training. Its simplicity in implementation and effectiveness in extending context without architectural changes make it a promising addition to the toolbox for scaling transformer models in multi-document environments.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com