Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fewer Truncations Improve Language Modeling (2404.10830v2)

Published 16 Apr 2024 in cs.CL, cs.AI, and cs.LG
Fewer Truncations Improve Language Modeling

Abstract: In LLM training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many documents into incomplete pieces, leading to excessive truncations that hinder the model from learning to compose logically coherent and factually consistent content that is grounded on the complete context. To address the issue, we propose Best-fit Packing, a scalable and efficient method that packs documents into training sequences through length-aware combinatorial optimization. Our method completely eliminates unnecessary truncations while retaining the same training efficiency as concatenation. Empirical results from both text and code pre-training show that our method achieves superior performance (e.g., relatively +4.7% on reading comprehension; +16.8% in context following; and +9.2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58.3%.

Fewer Truncations Improve LLMing: Introducing Best-fit Packing

Introduction to Best-fit Packing and Truncation Issues

The prevalent training approach for LLMs involves concatenating input documents followed by sequence splitting. This conventional method, although efficient, truncates documents excessively, fragmenting content integral to maintaining coherence and factual consistency. To counteract this, the new approach termed "Best-fit Packing" is proposed. It reframes sequence packing as a combinatorial optimization problem, emphasizing efficient and scalable handling, while substantially reducing unnecessary truncations. The results indicate superior performance and reduced hallucination across various pre-training scenarios.

Best-fit Packing: A Methodological Advancement

Best-fit Packing begins with segmenting documents longer than the model's maximum sequence length into shorter chunks. These chunks are then optimally packed into sequences, ensuring maximum context preservation without further segmentations. The process leverages a bin-packing problem strategy, specifically employing an optimized Best-Fit Decreasing algorithm, which is both scalable and preserves training efficiency comparable to the concatenation method. Remarkably, this method exhibits a 60% runtime improvement at the billion-document scale while achieving compactness levels on par with traditional techniques.

Empirical Validation and Performance Metrics

The empirical validation involved pre-training models on textual as well as code datasets, evaluating them across a spectrum of tasks including reading comprehension, natural language inference, context following, and program synthesis. Key findings are:

  • Performance Improvement: Relative improvements of up to +16.8% in context following tasks and +15.0% in program synthesis, validating that fewer truncations correlate with better model performance.
  • Reduction in Hallucination: Effective reduction in closed-domain hallucination by up to 58.3%, crucial for tasks like program synthesis where factual accuracy is paramount.
  • Scalability and Efficiency: Demonstrated scalability to billions of documents while maintaining compactness and computational efficiency similar to the concatenation approach.

Theoretical Insights and Analytical Validation

The paper also explores a simplified analytical model to demonstrate the adverse effects of truncation on model accuracy. This stochastic model analytically substantiates the empirical observations that truncated training leads to inferior learning outcomes, even when data availability is not a constraint.

Future Directions in LLM Training

Best-fit Packing potentially sets a precedent for future LLM training methodologies that prioritize data integrity without compromising efficiency. It opens avenues for exploring additional data packing strategies and their integration into standard LLM training pipelines. Additionally, this approach could enhance not only base model pre-training but also task-specific fine-tuning phases.

Conclusion: Towards More Coherent and Less Hallucinatory LLMs

In summary, Best-fit Packing addresses a critical flaw in the traditional LLM training regimen by mitigating excessive document truncation, thus enhancing logical coherence and factual consistency across model outputs. This method not only supports existing findings regarding the importance of comprehensive context in model training but also pioneers an efficient, scalable solution to a previously overlooked but significant problem.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Program synthesis with large language models. ArXiv preprint, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732.
  2. Combinatorial optimization: Theory and algorithms. Springer, Third Edition, 2005., 2008.
  3. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp.  7432–7439. AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/view/6239.
  4. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  5. Data-juicer: A one-stop data processing system for large language models. ArXiv preprint, abs/2309.02033, 2023. URL https://arxiv.org/abs/2309.02033.
  6. Evaluating large language models trained on code. ArXiv preprint, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
  7. QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  2174–2184, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1241. URL https://aclanthology.org/D18-1241.
  8. Palm: Scaling language modeling with pathways. ArXiv preprint, abs/2204.02311, 2022. URL https://arxiv.org/abs/2204.02311.
  9. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  10. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2924–2936, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclanthology.org/N19-1300.
  11. Computer, T. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  12. Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. ArXiv preprint, abs/2307.08691, 2023. URL https://arxiv.org/abs/2307.08691.
  13. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  14. A static evaluation of code completion by large language models. In Sitaram, S., Beigman Klebanov, B., and Williams, J. D. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pp.  347–360, Toronto, Canada, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-industry.34. URL https://aclanthology.org/2023.acl-industry.34.
  15. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2368–2378, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1246. URL https://aclanthology.org/N19-1246.
  16. The loading problem. Management Science, 17(5):259–268, 1971. ISSN 00251909, 15265501. URL http://www.jstor.org/stable/2628979.
  17. QAFactEval: Improved QA-based factual consistency evaluation for summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2587–2601, Seattle, United States, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.187. URL https://aclanthology.org/2022.naacl-main.187.
  18. The pile: An 800gb dataset of diverse text for language modeling. ArXiv preprint, abs/2101.00027, 2021. URL https://arxiv.org/abs/2101.00027.
  19. Teaching machines to read and comprehend. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp.  1693–1701, 2015. URL https://proceedings.neurips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html.
  20. Survey of hallucination in natural language generation. ArXiv preprint, abs/2202.03629, 2022. URL https://arxiv.org/abs/2202.03629.
  21. Towards mitigating LLM hallucination via self reflection. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  1827–1843, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.123. URL https://aclanthology.org/2023.findings-emnlp.123.
  22. Worst-case performance bounds for simple one-dimensional packing algorithms. SIAM J. Comput., 3(4):299–325, 1974. doi: 10.1137/0203025. URL https://doi.org/10.1137/0203025.
  23. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1601–1611, Vancouver, Canada, 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
  24. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pp.  15696–15707. PMLR, 2023.
  25. Ever: Mitigating hallucination in large language models through real-time verification and rectification, 2023.
  26. The stack: 3 tb of permissively licensed source code. ArXiv preprint, abs/2211.15533, 2022. URL https://arxiv.org/abs/2211.15533.
  27. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018. doi: 10.1162/tacl˙a˙00023. URL https://aclanthology.org/Q18-1023.
  28. Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance. ArXiv preprint, abs/2107.02027, 2021. URL https://arxiv.org/abs/2107.02027.
  29. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl˙a˙00276. URL https://aclanthology.org/Q19-1026.
  30. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177, 2022. doi: 10.1162/tacl˙a˙00453. URL https://aclanthology.org/2022.tacl-1.10.
  31. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  785–794, Copenhagen, Denmark, 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1082. URL https://aclanthology.org/D17-1082.
  32. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8424–8445, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.577. URL https://aclanthology.org/2022.acl-long.577.
  33. Holistic evaluation of language models. ArXiv preprint, abs/2211.09110, 2022. URL https://arxiv.org/abs/2211.09110.
  34. Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  35. Roberta: A robustly optimized BERT pretraining approach. ArXiv preprint, abs/1907.11692, 2019. URL https://arxiv.org/abs/1907.11692.
  36. Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7052–7063, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.565. URL https://aclanthology.org/2021.emnlp-main.565.
  37. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  38. Starcoder 2 and the stack v2: The next generation. ArXiv preprint, abs/2402.19173, 2024. URL https://arxiv.org/abs/2402.19173.
  39. When less is more: Investigating data pruning for pretraining llms at scale. ArXiv preprint, abs/2309.04564, 2023. URL https://arxiv.org/abs/2309.04564.
  40. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  1906–1919, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.173. URL https://aclanthology.org/2020.acl-main.173.
  41. Inverse scaling: When bigger isn’t better. ArXiv preprint, abs/2306.09479, 2023. URL https://arxiv.org/abs/2306.09479.
  42. Fine-grained hallucinations detections. ArXiv preprint, abs/2401.06855, 2024. URL https://arxiv.org/abs/2401.06855.
  43. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  1797–1807, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1206. URL https://aclanthology.org/D18-1206.
  44. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=iaYcJKpY2B_.
  45. Gpt-4 technical report, 2023.
  46. Training language models to follow instructions with human feedback, 2022.
  47. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. ArXiv preprint, abs/2306.01116, 2023. URL https://arxiv.org/abs/2306.01116.
  48. Check your facts and try again: Improving large language models with external knowledge and automated feedback, 2023.
  49. Scaling language models: Methods, analysis & insights from training gopher. ArXiv preprint, abs/2112.11446, 2021. URL https://arxiv.org/abs/2112.11446.
  50. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020a. URL http://jmlr.org/papers/v21/20-074.html.
  51. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020b. URL http://jmlr.org/papers/v21/20-074.html.
  52. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  784–789, Melbourne, Australia, 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. URL https://aclanthology.org/P18-2124.
  53. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Gupta, R., Liu, Y., Tang, J., and Prakash, B. A. (eds.), KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pp.  3505–3506. ACM, 2020. URL https://dl.acm.org/doi/10.1145/3394486.3406703.
  54. Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp.  8732–8740. AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/view/6399.
  55. Social IQa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4463–4473, Hong Kong, China, 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1454. URL https://aclanthology.org/D19-1454.
  56. BLOOM: A 176b-parameter open-access multilingual language model. ArXiv preprint, abs/2211.05100, 2022. URL https://arxiv.org/abs/2211.05100.
  57. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1073–1083, Vancouver, Canada, 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1099. URL https://aclanthology.org/P17-1099.
  58. In-context pretraining: Language modeling beyond document boundaries. In The Twelfth International Conference on Learning Representations, 2024.
  59. Megatron-lm: Training multi-billion parameter language models using model parallelism. ArXiv preprint, abs/1909.08053, 2019. URL https://arxiv.org/abs/1909.08053.
  60. Prompting GPT-3 to be reliable. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=98p5x51L5af.
  61. Roformer: Enhanced transformer with rotary position embedding. ArXiv preprint, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
  62. Fine-tuning language models for factuality, 2023.
  63. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023a. URL https://arxiv.org/abs/2302.13971.
  64. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023b. URL https://arxiv.org/abs/2307.09288.
  65. Superglue: A stickier benchmark for general-purpose language understanding systems. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  3261–3275, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html.
  66. Larger language models do in-context learning differently. ArXiv preprint, abs/2303.03846, 2023. URL https://arxiv.org/abs/2303.03846.
  67. Ethical and social risks of harm from language models. ArXiv preprint, abs/2112.04359, 2021. URL https://arxiv.org/abs/2112.04359.
  68. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112–1122, New Orleans, Louisiana, 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. URL https://aclanthology.org/N18-1101.
  69. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
  70. OPT: open pre-trained transformer language models. ArXiv preprint, abs/2205.01068, 2022. URL https://arxiv.org/abs/2205.01068.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Hantian Ding (11 papers)
  2. Zijian Wang (99 papers)
  3. Giovanni Paolini (28 papers)
  4. Varun Kumar (35 papers)
  5. Anoop Deoras (21 papers)
  6. Dan Roth (222 papers)
  7. Stefano Soatto (179 papers)
Citations (8)

HackerNews