How to Train Long-Context Language Models (Effectively) (2410.02660v1)

Published 3 Oct 2024 in cs.CL and cs.LG

Abstract: We study continued training and supervised fine-tuning (SFT) of a LLM (LM) to make effective use of long-context information. We first establish a reliable evaluation protocol to guide model development -- Instead of perplexity or simple needle-in-a-haystack (NIAH) tests, we use a broad set of long-context tasks, and we evaluate models after SFT with instruction data as this better reveals long-context abilities. Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset, and many other design choices. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short data; (2) training with a sequence length beyond the evaluation length boosts long-context performance; (3) for SFT, using only short instruction datasets yields strong performance on long-context tasks. Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K. ProLong outperforms Llama-3.18B-Instruct on the majority of long-context tasks despite having seen only 5% as many tokens during long-context training. Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.

PDF HTML Abstract

Overview of the Paper: How to Train Long-Context LLMs (Effectively)

The paper introduces robust strategies for training LLMs (LMs) capable of handling long contexts, which are instrumental for tasks requiring comprehensive document understanding. The focus is on continued and supervised fine-tuning (SFT) to enhance these abilities, using a newly established evaluation protocol that moves beyond perplexity-based assessments.

Evaluation Protocol

The authors designed a systematic evaluation approach, emphasizing practical applications such as retrieval-augmented generation (RAG), long-document summarization, and many-shot in-context learning (ICL). This evaluation occurs post-SFT to capture long-context capabilities accurately. Particularly noteworthy is the HELMET benchmark, which proved effective for assessing diverse long-context strengths. The robustness of this evaluation method was demonstrated by identifying limitations in models that scored perfectly on simpler benchmarks like needle-in-a-haystack (NIAH).

Data Curation Strategies

The paper emphasizes the importance of data mix, demonstrating that code repositories and books are rich sources for long-context data. Integrating this data with high-quality short-context data, dubbed ShortMix, was found crucial for maintaining the model's performance across varying context lengths. Training purely on long data, contrary to some prior assumptions, detracted from both long and short-context task performance, underscoring the necessity for a balanced approach.

Scaling and Training Length

ProLong, the final model resulting from the paper's findings, was trained with a sequence length exceeding the evaluation length, revealing that longer training sequences surprisingly enhance performance at shorter evaluation lengths. The research also delved into the nuances of RoPE frequency base tuning to further optimize the model's long-context capabilities.

Supervised Fine-Tuning

The authors challenge the presumption that long-context synthetic datasets are required for effective SFT. Instead, they found that standard short-context datasets suffice—with ProLong showing superior performance when trained with UltraChat despite incorporating only short instruction data during SFT.

Implications and Future Directions

The findings not only contribute a new state-of-the-art model in ProLong but also pivot the emphasis towards meaningful data curation and evaluation methodologies in long-context model training. As LMs evolve, these insights could guide the development of models adept at processing extended textual data, with applications in fields requiring nuanced document comprehension—such as legal and academic text analysis.

Continued exploration into the interaction of training length and data mix, as well as the development of robust, open benchmarks like HELMET, will be crucial. There is an opportunity for future work to expand on integrating efficient architectures with long-context capabilities, potentially leading to more computationally sustainable solutions. Overall, this research provides a foundational framework for advancing the practical utility of LMs in long-context applications.