LongAlign: A Recipe for Long Context Alignment of Large Language Models (2401.18058v1)

Published 31 Jan 2024 in cs.CL and cs.LG

Abstract: Extending LLMs to effectively handle long contexts requires instruction fine-tuning on input sequences of similar length. To address this, we present LongAlign -- a recipe of the instruction data, training, and evaluation for long context alignment. First, we construct a long instruction-following dataset using Self-Instruct. To ensure the data diversity, it covers a broad range of tasks from various long context sources. Second, we adopt the packing and sorted batching strategies to speed up supervised fine-tuning on data with varied length distributions. Additionally, we develop a loss weighting method to balance the contribution to the loss across different sequences during packing training. Third, we introduce the LongBench-Chat benchmark for evaluating instruction-following capabilities on queries of 10k-100k in length. Experiments show that LongAlign outperforms existing recipes for LLMs in long context tasks by up to 30\%, while also maintaining their proficiency in handling short, generic tasks. The code, data, and long-aligned models are open-sourced at https://github.com/THUDM/LongAlign.

PDF Abstract

Introduction

LLMs have exhibited impressive capabilities in handling complex tasks that involve extensive text, such as summarization, question answering, and coding. One crucial aspect of improving their performance is the ability to align models with long-context instructions, which typically demands instruction fine-tuning on lengthy input sequences. Current efforts in long-context LLMs are mainly centered on extending context windows, but this approach alone is insufficient for instruction alignment over long texts.

Related Work

When considering long-context scaling, two primary categories emerge: methods requiring fine-tuning and those that do not. Non-fine-tuned techniques might involve sliding window attention or neighboring token compression to address the positional out-of-distribution issue in attention computations for lengthy contexts; however, these approaches still fall short of the performance achieved by fine-tuned models. On the other hand, position encoding extension and continual pretraining on extended sequences are standard strategies employed by fine-tuned approaches. As these models scale, aligning them with long instruction-following datasets during a phase known as supervised fine-tuning becomes paramount to ensure they can handle diverse user requests in a chat interface.

LongAlign

To effectively train LLMs for extended interactions with users, LongAlign introduces an integrated process for data construction, efficient training, and robust evaluation. Its first salvo is an expansive dataset of long instruction data, derived from nine diverse sources and created using Self-Instruct to generate 10k instruction data of lengths ranging from 8k to 64k tokens. Training efficiency is amplified through packing and sorted batching strategies, reducing idle times in multi-GPU setups. Furthermore, to address biases introduced by these strategies, a loss weighting method is developed, ensuring balanced contributions to loss calculations across sequences of varying lengths. The evaluation benchmark, LongBench-Chat, brings realism into play with a series of open-ended questions, annotated by Ph.D. students, that span lengths of 10k to 100k tokens.

The standard baselines for evaluating the success of LLMs on long-context tasks have been outshone by LongAlign, which displays a performance enhancement of up to 30% on long-context tasks, without a concurrent drop in proficiency in handling short, generic tasks.

Findings and Contributions

LongAlign's efficacy is underlined by its success in scaling beyond the 64k token length mark while maintaining compatibility with shorter contexts. Empirical experiments unveil several insights:

The quantity and diversity of long instruction data crucially influencer a model's performance, optimizing outcomes by as much as 30%.
The packing and sorted batching training strategies are over 100% faster than traditional methods without trading off performance.
The loss weighting strategy notably amplifies long-context performance by 10%, addressing potential biases in training loss calculations.

Conclusion

LongAlign stands as a robust and efficient approach to align LLMs with long-context tasks. The contributions extend from innovative data collection and training methodologies to the creation of an evaluation benchmark capable of determining a model's aptitude in executing realistic long-context interactions. This development opens avenues for emerging tasks requiring in-depth understanding of extended texts by LLMs.