LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs (2408.07055v1)

Published 13 Aug 2024 in cs.CL and cs.LG

Abstract: Current long context LLMs can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words. Through controlled experiments, we find that the model's effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). In other words, their output limitation is due to the scarcity of long-output examples in existing SFT datasets. To address this, we introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, we construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words. By incorporating this dataset into model training, we successfully scale the output length of existing models to over 10,000 words while maintaining output quality. We also develop LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. Our 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark, surpassing even much larger proprietary models. In general, our work demonstrates that existing long context LLM already possesses the potential for a larger output window--all you need is data with extended output during model alignment to unlock this capability. Our code & models are at: https://github.com/THUDM/LongWriter.

Citations (12)

View on Semantic Scholar

Summary

The paper identifies that supervised fine-tuning datasets limit output length, and extending these lengths enhances generation capabilities.
The introduction of the AgentWrite pipeline decomposes ultra-long text generation into sequential, coherent subtasks.
The LongWriter-6k dataset and LongBench-Write benchmark validate a model’s state-of-the-art performance in extended, high-quality text generation.

LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

The paper "LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs" by Yushi Bai et al. addresses a significant limitation in the capabilities of current long context LLMs—their inability to generate extended outputs of more than 2,000 words despite being able to process inputs up to 100,000 tokens in length. This essay summarizes the key contributions, methodologies, and results of the paper, underscoring its implications for AI research and LLM applications.

Key Contributions

Identification of Generation Length Constraints: The paper identifies that the limitation in generation length primarily stems from the characteristics of the Supervised Fine-Tuning (SFT) datasets used during training. It establishes that the maximum generation length is closely tied to the longest outputs present in these datasets.
AgentWrite Pipeline: To tackle the constraint of output length in current models, the authors propose "AgentWrite," an agent-based pipeline designed to decompose ultra-long generation tasks into manageable subtasks. This enables existing LLMs to produce coherent outputs of over 20,000 words by sequentially crafting and integrating smaller sections of text.
LongWriter-6k Dataset: Utilizing AgentWrite, the authors construct the LongWriter-6k, a dataset with 6,000 supervised fine-tuning data entries featuring output lengths ranging from 2,000 to 32,000 words.
Scaling Output Length: Integrating the LongWriter-6k dataset into model training extended the output length capability of models to over 10,000 words without sacrificing quality.
LongBench-Write Benchmark: The authors develop LongBench-Write, a comprehensive benchmark suite for evaluating ultra-long generation capabilities, highlighting their 9B parameter model’s state-of-the-art performance.

Methodology

Controlled Experiments

The initial experiments revealed that the inherent limitation in output length correlates with the SFT training data's output length. By modifying the output lengths in the SFT data and observing the model's maximum output capabilities, the authors confirmed that increasing the length of training outputs directly influences the maximum length the model can generate.

AgentWrite Pipeline

AgentWrite operates in two steps:

Planning: The pipeline involves creating a detailed writing plan for the required output, which divides the task into smaller subtasks, each specifying a paragraph's main point and word count.
Sequential Generation: The model generates content sequentially for each subtask, ensuring coherence by integrating previously generated paragraphs into the context for subsequent sections.

Model Training

Leveraging the LongWriter-6k dataset and existing SFT data, the authors trained two models (LongWriter-9B and LongWriter-8B). They further refined LongWriter-9B using Direct Preference Optimization (DPO) to enhance content quality and adherence to lengthy output requirements.

Results

The paper presents several strong numerical results:

Output Length Capability: The integrated model can generate coherent texts exceeding 10,000 words and up to approximately 20,000 words.
Benchmark Performance: LongWriter models achieve higher scores in both length adherence and content quality on LongBench-Write compared to prior models, substantiating the advantage of incorporating long-output datasets.
Qualitative Metrics: The cumulative average negative log-likelihood analysis demonstrated the effective long-range dependencies in the outputs generated by LongWriter models, reinforcing the quality and coherence of extended texts.

Implications and Future Directions

The research demonstrates an effective strategy to harness existing LLMs' potential for longer outputs by merely altering the nature of supervised fine-tuning data. This innovation holds several practical and theoretical implications:

Practical Applications: The ability to generate extended, coherent texts opens up new possibilities in domains requiring detailed content generation, such as academic research, creative writing, and comprehensive reporting.
Theoretical Insights: The findings underscore the importance of output data characteristics in training datasets, suggesting further exploration into diverse and high-quality long-output datasets.
Future Developments: Expanding the dataset creation methods to achieve even longer outputs (beyond 20,000 words) and refining the AgentWrite pipeline could further enhance the models' capabilities. Additionally, improving inference efficiency while maintaining quality in ultra-long text generation remains a crucial area for future research.

Conclusion

In summation, the paper "LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs" presents a innovative approach to overcoming the intrinsic limitations of existing long context LLMs by extending their output lengths through novel data augmentation and supervised fine-tuning strategies. The results indicate marked improvements in both the length and quality of generated content, opening avenues for future advancements in long-form text generation within AI.

PDF Markdown

Related Papers

GitHub

GitHub - THUDM/LongWriter: LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs (1,461 stars)

Tweets

https://twitter.com/BrianRoemmele/status/1825170823508046192

https://twitter.com/osanseviero/status/1824000860583514323

https://twitter.com/ChatGLM/status/1823738374634791338

https://twitter.com/realYushiBai/status/1823550087999709276

https://twitter.com/gryhkn/status/1823798009857237119

https://twitter.com/taziku_co/status/1824027910912798726