LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization (2502.13922v3)

Published 19 Feb 2025 in cs.CL and cs.LG

Abstract: LLMs have demonstrated remarkable capabilities through pretraining and alignment. However, superior short-context LLMs may underperform in long-context scenarios due to insufficient long-context alignment. This alignment process remains challenging due to the impracticality of human annotation for extended contexts and the difficulty in balancing short- and long-context performance. To address these challenges, we introduce LongPO, that enables short-context LLMs to self-evolve to excel on long-context tasks by internally transferring short-context capabilities. LongPO harnesses LLMs to learn from self-generated short-to-long preference data, comprising paired responses generated for identical instructions with long-context inputs and their compressed short-context counterparts, respectively. This preference reveals capabilities and potentials of LLMs cultivated during short-context alignment that may be diminished in under-aligned long-context scenarios. Additionally, LongPO incorporates a short-to-long KL constraint to mitigate short-context performance decline during long-context alignment. When applied to Mistral-7B-Instruct-v0.2 from 128K to 512K context lengths, LongPO fully retains short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks. Specifically, LongPO-trained models can achieve results on long-context benchmarks comparable to, or even surpassing, those of superior LLMs (e.g., GPT-4-128K) that involve extensive long-context annotation and larger parameter scales. Our code is available at https://github.com/DAMO-NLP-SG/LongPO.

Summary

LongPO: Enhancing Long-Context Capabilities of LLMs

The paper entitled "LongPO: Long Context Self-Evolution of LLMs through Short-to-Long Preference Optimization" introduces an innovative methodology designed to address the discrepancies in LLMs regarding their performance over varying context lengths. Notably, while LLMs exhibit profound competencies in short-context scenarios due to pretraining and alignment, their long-context capabilities often remain under-optimized. This paper proposes LongPO as a solution to this problem, showcasing an internal mechanism through which LLMs can transform their short-context proficiency into long-context mastery without relying heavily on costly human annotations.

Problem Background

A primary challenge highlighted by the authors is that traditional methods of enhancing long-context abilities—such as Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO)—fail to maintain short-context performance while adequately scaling up long-context tasks. The scarcity of high-quality annotated long-context data further exacerbates this issue, as human intervention for context extensions often proves impractical and inefficient.

The LongPO Approach

LongPO stands out by operating without the need for new annotations from human experts. Instead, it leverages an internal learning mechanism that draws on self-generated short-to-long preference data. This data includes paired responses to identical instructions, exploring both long and compressed short-context inputs. Such pairing reveals latent potential within LLMs' short-context abilities that might not be fully realized in their long-context configurations.

The LongPO framework also introduces a Kullback-Leibler (KL) divergence-based constraint to guide alignment processes. This constraint maintains the integrity of short-context performance, ensuring minimal degradation during the model's evolution into handling extended contexts.

Empirical Results and Implications

When tested on context lengths extending from 128K to 512K, LongPO demonstrated notable robustness. It preserved the short-context competencies while significantly surpassing the performance of LLMs fine-tuned with naïve SFT or DPO techniques. Impressively, LongPO-trained models yielded results on long-context benchmarks that were on par with or exceeded those by larger, more comprehensively annotated models like GPT-4-128K.

The findings underscore the efficiency of LongPO in aligning LLMs to different context lengths without the resource-intensive processes of traditional human-based data annotation. This demonstrates promising scalability, especially important given the increasing demands on LLMs to perform complex tasks over extended contexts.

Future Prospects in AI

The approach articulated in this paper paves the way for more scalable, efficient methodologies in the development of LLMs. By emphasizing internal optimization mechanisms, LongPO provides a template for future models to self-adapt across varying lengths and complexities of data. This inherent flexibility broadens the potential applications of LLMs in real-world scenarios where data is not neatly annotated or easily categorized into predefined contexts.

As AI systems continue to evolve, methodologies like LongPO will likely play an essential role in bridging the gap between functional capability and practical application, ensuring that LLMs can seamlessly transition across diverse input lengths while retaining high fidelity and accuracy in their outputs.