Effective Long-Context Scaling of Foundation Models (2309.16039v3)

Published 27 Sep 2023 in cs.CL

Abstract: We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on LLMing, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks. Alongside these results, we provide an in-depth analysis on the individual components of our method. We delve into Llama's position encodings and discuss its limitation in modeling long dependencies. We also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths -- our ablation experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.

PDF Abstract

Effective Long-Context Scaling of Foundation Models: A Comprehensive Evaluation

The paper "Effective Long-Context Scaling of Foundation Models" provides an in-depth paper of extending LLMs to utilize longer contexts effectively. The authors present a series of LLMs derived through continual pretraining from the Llama 2 base, achieving effective context windows of up to 32,768 tokens. The research is particularly notable for its methodological approach, strong empirical results, and the extensive evaluation conducted across diverse benchmarks.

Methods

The authors utilize continual pretraining as a central strategy, leveraging long-sequence inputs to finetune Llama 2 models. The 7B and 13B variants are pretrained using 32,768-token sequences, whereas the 34B and 70B variants utilize 16,384-token sequences. This methodological choice addresses the computational challenges of training with longer sequences, optimizing the balance between sequence length and training efficiency.

Key to this approach is the modification of the positional encoding mechanism employed by Llama 2. The paper reveals that the original RoPE encoding limits the model's ability to aggregate long-range dependencies effectively. By altering the frequency base of RoPE, the authors manage to mitigate this limitation, thereby enhancing the model's ability to process extended contexts.

A significant insight from the paper is the finding that the data mix plays a crucial role in achieving long-context capabilities. The authors empirically demonstrate that improving long-context performance doesn't primarily rely on the abundance of long texts in the pretraining dataset. Instead, the quality of input data and the specific design choices in the training curriculum, such as sequence length progression, have a more profound impact.

Empirical Evaluation

The evaluation is conducted on both short and long-context tasks, with the models demonstrating significant improvements over their predecessors. On short-context benchmarks, improvements are observed particularly in coding, mathematics, and knowledge domains. The long-context tasks, evaluated using datasets like NarrativeQA, QuALITY, Qasper, and QMSum, saw the proposed models outperform existing open-source long-context models like Focused Transformer and MPT series.

Notably, the research highlights the importance of context length as a scaling axis for LLMs—effectively showing that context length scaling follows a power-law relationship, further validating continual pretraining as an efficient strategy.

Instruction Tuning

In addition to pretraining, the authors explore a cost-effective instruction tuning procedure. Without using any human-annotated data, the GPT-3.5-turbo-16k is outperforming on several long-context tasks such as question answering and summarization. This involves augmenting the finetuning datasets with synthetic examples generated by the base Llama 2 Chat model, focusing on QA formats.

Theoretical and Practical Implications

The paper's findings have several implications. Practically, they pave the way for more robust, cost-effective deployment of LLMs in applications requiring extensive context comprehension, such as contract analysis, complex knowledge synthesis, and multi-turn dialogues. Theoretically, they present an intriguing exploration of positional encoding effectiveness and the non-linear complexity scaling of LLMs with context length.

Conclusion

"Effective Long-Context Scaling of Foundation Models" provides substantial contributions to the field of natural language processing, especially in enhancing the performance of LLMs in applications needing extensive contextual comprehension. By effectively leveraging continual pretraining and adapting positional encodings, the authors demonstrate significant advances in the ability of LLMs to understand and generate content based on long sequences. Future research directions could further explore optimizing these methods for real-world deployments, improving tokenizer efficiency, and addressing the models’ susceptibility to hallucination in extended contexts. This paper establishes foundational insights for subsequent work on scaling and applying LLMs in diverse and complex domains.