Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Structured Packing in LLM Training Improves Long Context Utilization (2312.17296v7)

Published 28 Dec 2023 in cs.CL

Abstract: Recent advancements in long-context LLMs have attracted significant attention, yet their practical applications often suffer from suboptimal context utilization. This study investigates structuring training data to enhance semantic interdependence, demonstrating that this approach effectively improves context utilization. To this end, we introduce the Structured Packing for Long Context (SPLiCe) method, which utilizes retrieval to collate mutually relevant documents into long and coherent training examples. We validate SPLiCe empirically across models of varying sizes -- 3B, 7B, and 13B -- achieving improved performance in long-context tasks, such as Qasper and HotpotQA. Remarkably, even brief fine-tuning with SPLiCe is sufficient to realize these benefits. Additionally, SPLiCe effectively mitigates the lost-in-middle phenomenon often observed in large models. Our comprehensive analysis of SPLiCe explores its design choices and reveals intriguing transfer effects; for instance, training on programming code enhances performance on natural language tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Anthropic. Model card and evaluations for claude models. Technical report, Anthropic, 2023. URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf.
  2. Proofnet: A benchmark for autoformalizing and formally proving undergraduate-level mathematics problems. In Advances in Neural Information Processing Systems 35, 2nd MATH-AI Workshop at NeurIPS’22, 2022. URL https://mathai2022.github.io/papers/20.pdf.
  3. Qwen technical report, 2023.
  4. Elias Bassani. retriv: A Python Search Engine for the Common Man, May 2023. URL https://github.com/AmenRa/retriv.
  5. Improving language models by retrieving from trillions of tokens. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  2206–2240. PMLR, 2022. URL https://proceedings.mlr.press/v162/borgeaud22a.html.
  6. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
  7. Extending context window of large language models via positional interpolation, 2023.
  8. Palm: Scaling language modeling with pathways, 2022.
  9. A dataset of information-seeking questions and answers anchored in research papers, 2021.
  10. Harm de Vries. In the long (context) run, 2023. URL https://www.harmdevries.com/post/context-length. Accessed: 2023-09-28.
  11. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  12. Xinyang Geng. Easylm: A simple and scalable training framework for large language models, 2023. URL https://github.com/young-geng/EasyLM.
  13. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
  14. Pre-training to learn in context. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  4849–4870, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.267. URL https://aclanthology.org/2023.acl-long.267.
  15. Toward semantics-based answer pinpointing. In Proceedings of the First International Conference on Human Language Technology Research, 2001. URL https://www.aclweb.org/anthology/H01-1069.
  16. Unsupervised dense information retrieval with contrastive learning, 2022.
  17. Billion-scale similarity search with gpus, 2017a.
  18. Billion-scale similarity search with gpus, 2017b.
  19. The inductive bias of in-context learning: Rethinking pretraining example design. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=lnEaqbTJIRz.
  20. Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=IFXTZERXdM7.
  21. How long can open-source llms truly promise on context length?, June 2023a. URL https://lmsys.org/blog/2023-06-29-longchat.
  22. Starcoder: may the source be with you! CoRR, abs/2305.06161, 2023b. doi: 10.48550/arXiv.2305.06161. URL https://doi.org/10.48550/arXiv.2305.06161.
  23. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002. URL https://www.aclweb.org/anthology/C02-1150.
  24. Lost in the middle: How language models use long contexts. CoRR, abs/2307.03172, 2023. doi: 10.48550/arXiv.2307.03172. URL https://doi.org/10.48550/arXiv.2307.03172.
  25. OpenAI. Gpt-4 technical report, 2023.
  26. Training language models to follow instructions with human feedback, 2022.
  27. Yarn: Efficient context window extension of large language models, 2023.
  28. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, 2009. doi: 10.1561/1500000019.
  29. Code llama: Open foundation models for code, 2023.
  30. Scrolls: Standardized comparison over long language sequences, 2022.
  31. Large language models can be easily distracted by irrelevant context. ArXiv, abs/2302.00093, 2023. URL https://api.semanticscholar.org/CorpusID:256459776.
  32. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
  33. TogetherComputer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  34. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  35. Focused transformer: Contrastive training for context scaling. CoRR, abs/2307.03170, 2023. doi: 10.48550/arXiv.2307.03170. URL https://doi.org/10.48550/arXiv.2307.03170.
  36. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
  37. Memorizing transformers. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=TrjbxzRcnf-.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Konrad Staniszewski (6 papers)
  2. Szymon Tworkowski (7 papers)
  3. Sebastian Jaszczur (8 papers)
  4. Henryk Michalewski (42 papers)
  5. Łukasz Kuciński (20 papers)
  6. Yu Zhao (208 papers)
  7. Piotr Miłoś (52 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.