LongWriter-Zero Model: RL for Ultra-Long Text

Updated 30 June 2025

LongWriter-Zero is an RL-only LLM that generates ultra-long outputs without relying on supervised fine-tuning or synthetic datasets.
It leverages Group Relative Policy Optimization and composite reward models to balance text length, quality, and structural coherence.
Benchmark results on WritingBench and Arena-Write Elo highlight its superiority in producing coherent, extended texts for technical and academic applications.

The LongWriter-Zero model is a LLM framework designed to master ultra-long text generation—producing outputs that span thousands to tens of thousands of words—entirely through reinforcement learning (RL), with no reliance on supervised fine-tuning (SFT) or synthetic long-form datasets. Developed atop Qwen2.5-32B, LongWriter-Zero innovates in both methodology and reward modeling to overcome the persistent challenges of long-context output, including quality degradation, length adherence, and structural coherence as sequences scale. This model sets new benchmarks on ultra-long writing tasks, establishing the RL-only approach as a compelling alternative to SFT-based and synthetic data-driven paradigms.

1. Model Architecture and RL Training Paradigm

LongWriter-Zero is based on the Qwen2.5-32B architecture, a transformer LLM, which first undergoes continual pretraining on 30 billion writing-intensive tokens (books, reports, academic texts, and distilled chain-of-thought data) in English and Chinese. Distinctly, LongWriter-Zero eschews any SFT or curated synthetic SFT data for ultra-long output, employing RL as the sole driver for emergence of long-form capabilities. This approach is motivated by the limitations of SFT-based long writing—synthetic samples tend to be monotonous, artificial, and expensive to produce at sufficient scale.

The RL framework is built upon Group Relative Policy Optimization (GRPO), an extension of Proximal Policy Optimization (PPO) tailored for large-scale, open-ended text generation. For each prompt, multiple candidate completions (group size G) are sampled. A composite reward model, addressing length, holistic writing quality, and structure, scores each candidate. The GRPO advantage for each is normalized within the group: $A_i = \frac{r_i - \operatorname{mean}(\{r_1, ..., r_G\})}{\operatorname{std}(\{r_1, ..., r_G\})}$ The policy is then updated to maximize the expected normalized advantage across prompts and candidate completions, subject to PPO-style clipping and (optionally) KL regularization: $J_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q, \{o_i\}}\left[ \frac{1}{G}\sum_{i=1}^G \min\left( r_i^{\text{ratio}} A_i,\, \mathrm{clip}(r_i^{\text{ratio}}, 1-\varepsilon, 1+\varepsilon) A_i \right) - \beta D_{\mathrm{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]$ with $r_i^{\text{ratio}} = \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)}$ .

Instructional prompts for RL are harvested from real-world chat corpora (WildChat-1M, LMSYS-Chat-1M), filtered for complexity and length, with explicit support for chain-of-thought planning via special > ... </think><answer> ... </answer> structures.

2. Reward Modeling and Ultra-Long Output Control

The reward signal for RL is a composite of three independently trained reward models (RMs):

Length Reward Model: Penalizes outputs falling outside a prompt-appropriate length band, as predicted by a length estimator (QwQ-32B).

$r_{\mathrm{length}}(o) = \begin{cases} 1, & L_{\text{lower}} \leq \mathrm{len}(o) \leq L_{\text{upper}} \ \frac{\mathrm{len}(o)}{L_{\text{lower}}}, & \mathrm{len}(o) < L_{\text{lower}} \ \frac{L_{\max}-\mathrm{len}(o)}{L_{\max} - L_{\text{upper}}}, & \mathrm{len}(o) > L_{\text{upper}} \end{cases}$

Writing Reward Model: Implements preference-based reward (Bradley-Terry loss) on pairwise comparisons judged for fluency, coherence, and relevance, using Qwen2.5-72B as the backbone.

$\mathcal{L} = -\mathbb{E}_{(x, y_w, y_l)\sim D} \left[ \log\sigma\left( r_{\text{write}}(x, y_w) - r_{\text{write}}(x, y_l) \right) \right]$

Format Reward Model: Enforces required output structure (e.g., clear separation of plan and answer, <think>...</think><answer>...</answer>), and penalizes semantic redundancy to prevent length inflation via repetition.

These are combined not by naïve averaging of scores but by normalization of each advantage, which are then mean-averaged: $A_{\text{final}} = \frac{1}{3}(A_{\text{length}} + A_{\text{write}} + A_{\text{format}})$ This ensures stability and balances tradeoffs between length precision and writing quality during RL.

3. Performance Benchmarks and Comparative Evaluation

LongWriter-Zero exhibits state-of-the-art results on both human and automated long-form writing benchmarks:

WritingBench (comprehensive 1,200-prompt benchmark, scores via specialized critic):

LongWriter-Zero: 8.69, DeepSeek-R1: 8.55, Qwen3-235B-A22B: 8.68, GPT-4o: 8.16, LongWriter (SFT): 7.91.

Arena-Write Elo (comparison with models of up to 235B parameters):

LongWriter-Zero: 1447 Elo (highest), DeepSeek-R1: 1343, Qwen3-235B-A22B: 1343, Qwen2.5-Max: 1029, LongWriter (SFT): 457.

Human and LLM-Judged Win-Rates: On head-to-head evaluation against Qwen3-235B-A22B, DeepSeek-V3, Llama-4, Claude, and Gemini, LongWriter-Zero achieves over 98% win-rate (LLM-critic), and surpasses 62% human win-rate in direct pairwise comparisons.

Further, the RL-based training paradigm consistently outperforms SFT (even when SFT is enriched with continual pretraining), indicating that RL unlocks a higher performance ceiling for both quality and controllable length.

4. Structural and Methodological Innovations

Several innovations distinguish LongWriter-Zero in the landscape of long text generation:

RL-Only Paradigm: Uniquely, LongWriter-Zero achieves emergent ultra-long output—producing globally coherent, structurally diverse, and high-quality writing—entirely via reinforcement learning from unannotated instructions, addressing data bottlenecks and removing dependency on expensive, synthetic SFT corpora.

Composite, Balanced Reward Design: The aggregation of length, writing quality, and structure as separate RMs, with advantage normalization, enables RL to target nuanced, multi-faceted objectives critical for ultra-long coherence and adherence to specification.

Chain-of-Thought Planning Support: Prompts and format rewards explicitly encourage explicit planning (<think>...) and structured answering, strengthening logical progression and reducing topic drift across extended outputs.

Continual Pretraining Synergy: Writing-heavy continual pretraining prior to RL positions the model for greater "structure sense," foundational for RL to effectively learn open-ended, long-form compositionality.

5. Addressing Challenges in Ultra-Long Text Generation

LongWriter-Zero’s technical design directly addresses typical pathologies in extended output:

Coherence and Consistency: Rewarding both format and planning ensures outputs do not degrade into repetitiveness or incoherence as sequence length increases, with explicit penalties for redundancy at the semantic level.
Bias and Reward Hacking: Composite rewards mitigate known RL pathologies such as "length hacking" (padding with repetition) and gaming of writing quality metrics by keyword injection. Nonetheless, the model’s creators note that reward hacking may still occur, especially for future, adversarially robust evaluations.
Data and Model Scalability: RL-only training removes the need for scaling synthetic long-form SFT sets, allowing generalization to diverse domains and complex real-world instruction distributions.

6. Applications, Limitations, and Outlook

LongWriter-Zero is suited for automated generation of technical reports, academic papers, legal documentation, business proposals, multi-section educational materials, and agent-driven collaborative analyses, especially where length and structure control are paramount. It is applicable to both English and Chinese, having been trained on bilingual data.

A plausible implication is that, as composite RL reward models become more discourse-aware and robust to exploitation, and as agentic planning capabilities improve, RL-only large LLMs like LongWriter-Zero will displace SFT- and synthetic-data-driven systems for open-ended, ultra-long content production in both enterprise and research applications.

Limitations include susceptibility to sophisticated reward hacking, potential over-optimization of specific reward model features, and the need for further research into discourse-level evaluators. Future work may center on:

Adversarial or uncertainty-aware reward models,
Expanded, diverse instruction prompts,
Deeper integration with agent-environment interaction for dynamic multi-stage composition.

7. Summary Table: Core Components and Benchmarking

Component	Role/Function	Implementation
Base Model	Transformer LLM (Qwen2.5-32B)	Continual pretraining on 30B tokens
RL Algorithm	Group Relative Policy Optimization (GRPO)	Batch-grouped, normalized advantage
Length RM	Length adherence, per-prompt window	QwQ-32B length predictors
Writing Quality RM	Holistic preference, pairwise comparison	Qwen2.5-72B, Bradley-Terry loss
Format RM	Structure enforcement, redundancy penalization	Semantic overlap, token structure
Output Performance	WritingBench: 8.69 / Arena-Write: 1447 Elo	Exceeds SFT/100B+ model baselines

LongWriter-Zero is open-sourced at https://huggingface.co/THU-KEG/LongWriter-Zero-32B, providing resources for further research and practical application in ultra-long text generation.

PDF Markdown Chat (Upgrade)

LongWriter-Zero Model: RL for Ultra-Long Text

1. Model Architecture and RL Training Paradigm

2. Reward Modeling and Ultra-Long Output Control

3. Performance Benchmarks and Comparative Evaluation

4. Structural and Methodological Innovations

5. Addressing Challenges in Ultra-Long Text Generation

6. Applications, Limitations, and Outlook

7. Summary Table: Core Components and Benchmarking

Related Topics