Language Models can Self-Lengthen to Generate Long Texts (2410.23933v1)

Published 31 Oct 2024 in cs.CL

Abstract: Recent advancements in LLMs have significantly enhanced their ability to process long contexts, yet a notable gap remains in generating long, aligned outputs. This limitation stems from a training gap where pre-training lacks effective instructions for long-text generation, and post-training data primarily consists of short query-response pairs. Current approaches, such as instruction backtranslation and behavior imitation, face challenges including data quality, copyright issues, and constraints on proprietary model usage. In this paper, we introduce an innovative iterative training framework called Self-Lengthen that leverages only the intrinsic knowledge and skills of LLMs without the need for auxiliary data or proprietary models. The framework consists of two roles: the Generator and the Extender. The Generator produces the initial response, which is then split and expanded by the Extender. This process results in a new, longer response, which is used to train both the Generator and the Extender iteratively. Through this process, the models are progressively trained to handle increasingly longer responses. Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation, when applied to top open-source LLMs such as Qwen2 and LLaMA3. Our code is publicly available at https://github.com/QwenLM/Self-Lengthen.

Summary

The paper introduces Self-Lengthen, an iterative framework that enables LLMs to autonomously extend their text outputs.
It employs a dual-role methodology with a Generator and an Extender to iteratively refine and lengthen narratives, boosting output lengths up to eight times.
Experimental validations on open-source models like Qwen2 and LLaMA3 show enhanced text quality and adherence to length constraints without compromising general performance.

LLMs and Self-Lengthening: Enhancing Long Text Generation

Recent developments in the field of LLMs have highlighted their proficiency in understanding long contexts. However, generating extended, coherent outputs remains a challenge, a situation exacerbated by a training dichotomy: pre-training on diverse long texts without targeted long-generation tasks and post-training primarily involving short exchanges. This paper introduces "Self-Lengthen," a novel iterative training framework designed to exploit the inherent capabilities of LLMs, thereby extending their ability to produce long-text outputs without relying on supplementary data or proprietary models.

Overview of the Self-Lengthen Framework

The Self-Lengthen method is structured around two complementary roles: the Generator and the Extender. Initially, the Generator creates the first draft of a response to a given query, which is subsequently expanded by the Extender. The process iteratively refines both modules, progressively equipping them to handle more extended narratives. This autonomous training paradigm distinguishes itself from conventional techniques like instruction backtranslation and behavior imitation, which are fraught with issues such as data authenticity or access limitations associated with proprietary LLMs.

Experimental Validation and Results

The paper outlines comprehensive experiments involving open-source LLMs, notably Qwen2 and LLaMA3, to benchmark the efficacy of Self-Lengthen against existing paradigms. Experimental data and human assessments collectively affirm that Self-Lengthen surpasses traditional methods in producing long-form text. Remarkably, the LLMs refined using Self-Lengthen demonstrate the capability to sustain content quality while markedly enhancing output lengths, achieving outputs up to eight times longer than their initial capacity.

The efficacy of Self-Lengthen is quantitatively supported through diverse evaluative metrics. Length-following and quality scores indicate its superior alignment with length constraints while maintaining high relevance, coherence, and engagement in the text. Moreover, the framework ensures that the gains in long-form output do not undermine performance on general tasks, a claim validated through standard benchmarks like MMLU and AlignBench.

Theoretical and Practical Implications

The introduction of Self-Lengthen opens new avenues in LLM applications where long-text generation is crucial, such as automated report generation, story writing, and documentation. Theoretically, this advancement emphasizes the potential of harnessing self-alignment within LLMs, underscoring a paradigm shift towards endogenous model development free from external augmentation.

Applications beyond current capabilities could include more advanced adaptations where LLMs autonomously adjust and optimize for varying task-specific constraints, a progression from simply extending text to ensuring contextually rich and varied outputs across disciplines.

Future Directions

Looking ahead, further exploration lies in optimizing the Generator and Extender modules, specifically tailoring them for different kinds of long-text domains such as legal texts or scientific documentation. Additionally, investigating the adaptability of the Self-Lengthen framework to other LLM architectures can provide insights into universal long-text generation enhancements across diverse model types.

In conclusion, Self-Lengthen presents an innovative, scalable approach to long-text generation by leveraging existing model knowledge and extending their capabilities intrinsically. It not only propels LLMs towards achieving greater autonomy in output generation but also sets a precedent for future developments focused on enhancing model functionalities without external dependencies.

PDF Markdown

Related Papers

GitHub

GitHub - QwenLM/Self-Lengthen (11 stars)

Tweets

https://twitter.com/javaeeeee1/status/1852309840854503505

https://twitter.com/arXivGPT/status/1852792002372624401

https://twitter.com/arXivGPT/status/1853501698062336111

https://twitter.com/arXivGPT/status/1853139165354860555