LongWriter-Zero: RL for Ultra-Long Text Generation

Updated 1 July 2025

LongWriter-Zero is a reinforcement learning-based framework that generates ultra-long, coherent texts by directly incentivizing global writing quality.
It utilizes Group Relative Policy Optimization and specialized reward models to optimize text planning, factual accuracy, and structural formatting.
Performance benchmarks show it outperforms SFT methods, achieving state-of-the-art results on extensive long-form generation tasks using a Qwen2.5-32B base model.

LongWriter-Zero is a reinforcement learning–based framework developed to master ultra-long, high-quality text generation using LLMs, addressing the inherent limitations of supervised fine-tuning approaches that rely on synthetic demonstration data. It represents a paradigm shift in training LLMs for tasks that demand extended, coherent, and structurally robust outputs—such as comprehensive reports, long-form stories, or educational materials—by directly incentivizing desired global writing properties through learned reward models, rather than maximum likelihood estimation on reference outputs.

1. Motivation and Theoretical Foundations

The central challenge in ultra-long text generation is maintaining high output quality (coherence, factual accuracy, global consistency) as the output length increases, often into tens of thousands of tokens. Existing methods predominantly employ supervised fine-tuning (SFT) over synthetic long-form datasets generated by agentic pipelines or teacher LLMs, exemplified by works such as LongWriter and Suri. However, SFT-based methods face two critical problems:

Quality Ceiling: Synthetic data quality is inherently capped by the teacher models, and generated outputs tend toward monotony and lack diversity.
Objective Alignment: The SFT objective (maximum likelihood) does not provide any incentive for planning, reasoning, or optimizing the global document structure.

LongWriter-Zero adopts an RL-only paradigm, completely eliminating dependence on annotated or synthetic demonstration data. Instead, it leverages reward models (RMs) to directly incentivize the emergence of desirable properties—output length control, structural formatting, planning, and overall writing quality—thus overcoming the fundamental bottlenecks of SFT.

2. Reinforcement Learning Framework

Group Relative Policy Optimization (GRPO)

LongWriter-Zero employs Group Relative Policy Optimization (GRPO), an advanced RL algorithm derived from PPO, which tailors the policy update to relative group performance within each mini-batch:

For a batch of $G$ generated outputs $\{ o_1, o_2, \ldots, o_G \}$ on prompt $q$ , each output is scored by reward models.
The group-relative advantage for the $i$ th output is computed as:

$A_i = \frac{r_i - \operatorname{mean}(\{ r_1, \ldots, r_G \})}{\operatorname{std}(\{ r_1, \ldots, r_G \})}$

The clipped policy gradient objective for a batch is:

$J_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q,\{o_i\}}\left[\frac{1}{G} \sum_{i=1}^{G} \min\left( r_i^{\text{ratio}} A_i, \operatorname{clip}(r_i^{\text{ratio}}, 1 - \epsilon, 1 + \epsilon) A_i \right) - \beta D_{\mathrm{KL}}(\pi_\theta || \pi_\mathrm{ref}) \right]$

In practice, the KL penalty coefficient $\beta$ can be set to zero (as in DAPO), relying on continual pretraining and reward models to maintain output diversity and reference policy alignment.

Reward Model Design

Three specialized RMs are trained to provide reward signals on key dimensions:

Length RM ( $r_{\mathrm{length}}$ ): Enforces adherence to a desired output length range $[L_\mathrm{lower}, L_\mathrm{upper}]$ predicted for each prompt. Samples outside the interval are penalized proportionally to deviation:

$r_{\mathrm{length}}(o) = \begin{cases} 1 & L_\mathrm{lower} \leq \mathrm{len}(o) \leq L_\mathrm{upper} \ \frac{\mathrm{len}(o)}{L_\mathrm{lower}} & \mathrm{len}(o) < L_\mathrm{lower} \ \frac{L_\mathrm{max} - \mathrm{len}(o)}{L_\mathrm{max} - L_\mathrm{upper}} & \mathrm{len}(o) > L_\mathrm{upper} \end{cases}$

Writing Quality RM ( $r_{\mathrm{write}}$ ): Evaluates overall writing quality. Trained via the Bradley-Terry loss on human preference data:

$\mathcal{L} = -\mathbb{E}_{(x, y_w, y_l) \sim D}[\log \sigma(r_{\mathrm{write}}(x, y_w) - r_{\mathrm{write}}(x, y_l))]$

Format RM: Rewards outputs that adhere to the > …<answer>…</answer> structure, penalizing lapses and repetition.

The normalized advantages from these RMs are combined (by averaging the $A_{\mathrm{length}}$ , $A_{\mathrm{write}}$ , $A_{\mathrm{format}}$ ) to provide a balanced composite reward.

3. Model Architecture and Pretraining

The system is built upon the Qwen2.5-32B base model, which natively supports long-context inference (up to 32,000 tokens). Before reinforcement learning, a continual pretraining phase on 30B tokens of diverse, high-quality long-form texts (books, reports, fiction, and nonfiction) is performed, with 1% “long-chain-of-thought” samples. This continual pretraining enhances planning capabilities and general writing skill, which is critical for stabilizing RL and enabling the “explicit thinking” strategy—requiring outputs to be split into “reasoning” and “answer” segments that encourage global coherence and topic adherence.

No architectural modifications (e.g., changes to attention or transformer blocks) are made; performance gains arise from data composition, planning-centric outputs, and incentive structures.

4. Experimental Setup and Evaluation

Training Regimen

Hardware: Distributed training on 8 nodes × 8 H800 GPUs with Megatron, utilizing 32 concurrent prompts per batch.
RL Sampling: Stochastic sampling (top-p = 1.0, temperature = 0.8) supports response diversity.
Optimization Hyperparameters: Clipping parameter $\epsilon=0.2$ ; no explicit KL penalty.
Queries: Prompts are drawn from large, real-world datasets (WildChat-1M and LMSYS-Chat-1M), filtered to focus on writing-centric, long-form scenarios via classifier LLMs.

Benchmarks and Metrics

WritingBench: 1,200 real-world, long-form prompts spanning six domains and three requirement types, scored by an LLM critic (Qwen2.5-7B) on style, format, length, and other writing criteria.
Arena-Write: 100 prompts with pairwise LLM-based (Qwen2.5-72B) Elo ranking across top SFT and RL baselines (including DeepSeek R1 and Qwen3-235B).
Human-in-the-loop Evaluation: 200 prompts with both human and LLM preference voting.
Ablations: Assessed the contribution of explicit “think” prompts and continual pretraining; both are found essential to the model’s performance.

5. Results and Analysis

Benchmark Performance: LongWriter-Zero achieves 8.69 average critic score on WritingBench and 1447 Elo on Arena-Write, surpassing DeepSeek-R1 (1343 Elo), Qwen3-235B (1343 Elo), and other SFT/RLHF baselines. Human and LLM preference win rates reach up to 98% against competitive 100B+ models.
Qualitative Gains: RL-trained models exhibit superior document planning, topic adherence, logical progression, and global de-duplication. Explicit “<think>…” structure improves document organization and factual consistency in ultra-long outputs.
Ablation Findings: Removing explicit “think” steps or continual pretraining reduces Elo by hundreds of points, confirming their critical contributions.
SFT vs RL: SFT models plateau at roughly 1,000 Elo, whereas pure RL training enables the model to reach 1,447 Elo, demonstrating superior ability to globally optimize for coherence, structure, and length control.

6. Implications, Limitations, and Future Work

LongWriter-Zero establishes the first RL-only approach—eschewing SFT—capable of achieving state-of-the-art ultra-long text generation. Major implications include:

RL as a Viable Alternative for Ultra-Long Tasks: Explicit reward modeling can guide LLMs to master skills that are challenging to incentivize through maximum likelihood learning alone.
Overcoming SFT Data Bottlenecks: The approach does not require expensive synthetic long-form datasets or agentic data production pipelines.
Emergence of Planning: RL can foster planning and multi-step reasoning without explicit demonstration, especially when paired with prompt engineering (the <think>… structure).

Notable limitations include susceptibility to reward model exploits such as superficial length padding or keyword stuffing; the need for robust, discourse-aware reward models remains. Future directions suggested in the paper include adversarial RM training, human-in-the-loop RL, and refinement of evaluation protocols to further align RL incentives with actual human writing preferences.

7. Open Source Release and Community Impact

The complete LongWriter-Zero-32B model, reward models, training code, and major components of the underlying RL data are released under a permissive license at:

https://huggingface.co/THU-KEG/LongWriter-Zero-32B

This enables the research and development community to directly reproduce results, further experiment with RL-based long-form generation, and apply these methods to new domains and model architectures without dependency on large, curated synthetic datasets.

Dimension	LongWriter-Zero
Approach	RL-only (no SFT, no synthetic data), multi-RM incentivization
Base Model	Qwen2.5-32B, with continual pretraining
Explicit Structure	<think> reasoning + <answer> output; planning-centric
Benchmarks (Best Score)	8.69 (WritingBench), 1447 Elo (Arena-Write)
Outperforms	DeepSeek-R1 (100B+), Qwen3-235B, strong SFT/RLHF baselines
Released Artifacts	Model, reward models, data, code on HuggingFace

LongWriter-Zero demonstrates that RL, equipped with explicit, compositionally balanced reward models and planning-facilitating interventions, provides a powerful toolkit for advancing the quality, structure, and controllability of ultra-long LLM generation systems.

PDF Markdown Chat (Upgrade)