Data Engineering for Scaling Language Models to 128K Context

Published 15 Feb 2024 in cs.CL and cs.AI | (2402.10171v1)

Abstract: We study the continual pretraining recipe for scaling LLMs' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the \textit{quantity} and \textit{quality} of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize \textit{domain balance} and \textit{length upsampling}. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of LLMs to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.

Abstract PDF Upgrade to Chat

Citations (84)

View on Semantic Scholar

Summary

The paper demonstrates that advanced data engineering can scale language models to handle extended contexts of up to 128K tokens, significantly narrowing the performance gap with leading models.
It employs a balanced strategy combining quantitative data volume and qualitative domain diversity to optimally upsample and manage training data without compromising model integrity.
The findings pave the way for enhanced applications such as multi-document comprehension, in-depth code analysis, and robust dialog systems, marking a significant leap in AI capabilities.

Enhancing LLMs for Extended Context Understanding through Data Engineering Strategies

Introduction to Extended Context Capacity in LLMs

LLMs have steadily evolved, displaying remarkable capabilities in generating coherent and contextually relevant text. Recent advancements have pushed the boundaries further by expanding the context window of these models to an impressive 128K tokens. Such an expansion enables the models to explore applications that were previously infeasible, including multi-document comprehension, in-depth code analysis, and comprehensive dialog systems. Central to this progression is not just the advancement in model architecture, but significantly, the meticulous engineering of the data that feeds these models.

Data Engineering: The Core of Scaling Context

The capacity of LLMs to parse and leverage information from vastly extended contexts, specifically how they manage data at such scale, is fundamental to their success. The challenge is not merely the extension of the model's capacity to absorb longer contexts but ensuring the model can effectively utilize this expanded horizon. This hinges on the optimal selection, allocation, and engineering of training data—a process crucial yet complex, given the models' already massive scale.

Quantitative and Qualitative Data Considerations

For scaling LLMs to parse and understand extended context lengths effectively, both the quantity and quality of data become pivotal. From a quantitative perspective, research indicates that a range of 500 million to 5 billion tokens suffices for these models to harness long contexts effectively. Qualitatively, the balance of domains within the training data and a methodical approach to upsampling data lengths emerge as critical factors. Notably, naive upsampling of longer texts from specific domains, a common practice, results in subpar model performance. Instead, maintaining a balanced domain mixture while upsampling long sequences within each domain is recommended. This approach helps in preserving the integrity of domain diversity, which is imperative for the model's general applicability.

Experimental Insights and Achievements

The advocated data engineering strategy significantly narrows the performance gap between open-source models and state-of-the-art, proprietary models like GPT-4 128K. By meticulously adjusting the input data — specifically tailoring the continuity and richness of the context data — researchers have managed to not only retain but improve the model's performance in extended context tasks. This ranged from complex sentence retrieval tests, dubbed the Needle-in-a-Haystack test, to real-world applications such as BookQA, demonstrating the model's remarkable accuracy and versatility.

The Future of Extended Context in AI

The implications of these findings are far-reaching, especially concerning the theoretical and practical applications of AI. By extending the LLMs' understanding to contexts well beyond the traditional limits, new vistas in AI research and application are unveiled. This includes enhanced multi-document comprehension and more profound insights across vast datasets, potentially revolutionizing how information is processed, understood, and generated by AI.

Conclusion

The journey to scale LLMs to comprehend extended contexts up to 128K tokens has underscored the significance of data engineering. Through a sophisticated blend of quantitative adequacy and qualitative balance, the venture has not only bridged the gap to the leading frontier models but also set a new precedent for future explorations in AI. As the field continues to progress, the focus on refining data engineering techniques will remain at the forefront, paving the way for even more capable and versatile LLMs.

Markdown