Papers
Topics
Authors
Recent
2000 character limit reached

Ultra-Long Output RL: Advancing Text Generation

Updated 10 January 2026
  • UloRL is a family of reinforcement learning methods designed to extend language model outputs to tens or hundreds of thousands of tokens using segmentation and reward model innovations.
  • Key techniques include segmented rollouts, dynamic curriculum scheduling, and entropy management to overcome sequence bottlenecks and quality degradation.
  • Empirical results demonstrate enhanced throughput, reasoning accuracy, and structural fidelity, making UloRL a promising approach for long-form writing and verifiable reasoning tasks.

Ultra-Long Output Reinforcement Learning (UloRL) encompasses a family of reinforcement learning (RL) methodologies engineered to extend the effective output length of LLMs, with controllable structure and superior quality at scales far surpassing conventional maximum generation limits. UloRL addresses acute limitations in sequence generation, including length boundaries, entropy collapse, and quality degradation—enabling models to generate and reason over ultra-long outputs ranging from tens to hundreds of thousands of tokens. Contrasted with supervised fine-tuning (SFT) on synthetic data, UloRL frameworks leverage intricate reward models, dynamic training curricula, and architectural innovations such as segmented rollouts and entropy regularization to optimize policy learning for text generation and reasoning tasks. State-of-the-art implementations—LongWriter-Zero (Wu et al., 23 Jun 2025), Writing-RL (Lei et al., 6 Jun 2025), and UloRL (Du et al., 26 Jul 2025)—demonstrate advantages on long-form writing benchmarks, verifiable reasoning, and throughput, outperforming even larger proprietary baselines.

1. Formal Problem Definition and Challenges

UloRL formulates ultra-long text generation as a single-episode Markov Decision Process (MDP) with a fixed-length horizon TT representing the output sequence. Given a prompt ww sampled from a distribution D\mathcal{D}, the policy πθ\pi_\theta engenders a sequence τ=(a1,,aT)\tau = (a_1,\ldots, a_T), with state sts_t comprising (w,a1,,at1)(w, a_1,\ldots, a_{t-1}) and actions ata_t drawn from vocabulary V\mathcal{V} (Lei et al., 6 Jun 2025). The terminal scalar reward R(τ;w)R(\tau; w) is assigned after complete generation. Critical obstacles in UloRL settings include:

  • Long-Tail Sequence Bottlenecks: RL pipelines become inefficient as ultra-long target generations (Lmax128KL_{\mathrm{max}} \geq 128\mathrm{K} tokens) induce batch stalls due to a small fraction of very long outputs (Du et al., 26 Jul 2025).
  • Entropy Collapse: Continued training on highly mastered tokens induces policy entropy freeze, restricting exploration and harming diversity (Du et al., 26 Jul 2025).

These constraints render vanilla RL impractical for ultra-long output settings, demanding specialized sampling strategies and entropy management.

2. Core Methodologies: Segmented Rollouts and Curriculum Design

UloRL frameworks introduce segmentation and adaptive curricula to optimize RL throughput and curricular progress:

  • Segmented Rollouts (Editor's term): Output sequences are partitioned into MM equal-length segments (e.g., =Lmax/M\ell = L_{\mathrm{max}} / M, with Lmax=128KL_{\mathrm{max}} = 128\mathrm{K} and M=8M=8), avoiding batch stalls and allowing synchronous processing. Unfinished samples are advanced progressively through segments, with completed episodes added to an experience pool (Du et al., 26 Jul 2025). Empirical results show 2×\times speedup over conventional full-length rollouts.
  • Dynamic Curriculum Scheduling: In Writing-RL, difficulty adapts dynamically per sample via reference scheduling—the model is advanced to stronger references only upon "beating" the current one (Lei et al., 6 Jun 2025). Margin-aware data selection prioritizes training samples exposing high learning headroom.

These mechanisms collectively maintain training efficiency and facilitate progressive mastery across diverse output lengths and difficulty spectra.

3. Reinforcement Learning Objectives and Importance Sampling

UloRL frameworks build upon PPO and GRPO variants with modifications crucial for stability and diversity:

Ai=rimean(r1:G)std(r1:G)A_i = \frac{r_i - \mathrm{mean}(r_{1:G})}{\mathrm{std}(r_{1:G})}

The surrogate objective is

JGRPO(θ)=Eq,{oi}πθold[1Gi=1Gmin ⁣(riratioAi,clip(riratio,1ε,1+ε)Ai)]βDKL(πθπref)J_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q,\,\{o_i\} \sim \pi_{\theta_{\mathrm{old}}}} \left[ \frac{1}{G} \sum_{i=1}^G \min\!\Big( r_i^{\mathrm{ratio}} A_i, \mathrm{clip}(r_i^{\mathrm{ratio}},1-\varepsilon,1+\varepsilon) A_i \Big) \right ] - \beta D_{\mathrm{KL}}(\pi_\theta \|\pi_{\mathrm{ref}})

where ε\varepsilon controls clipping (Wu et al., 23 Jun 2025).

  • Importance Sampling Under Segmentation: Segment-Aware Importance Sampling (SAIS) computes ratios per segment; Pseudo On-Policy Importance Sampling (POIS) assigns all tokens the ratio of the latest segment checkpoint, with empirical results favoring POIS for ultra-long outputs (Du et al., 26 Jul 2025).

PPO variants are similarly adapted, with GAE for advantage calculation and KL penalty coefficients tuned for stability (Lei et al., 6 Jun 2025).

4. Reward Modeling and Planning Architectures

Reward designs in UloRL combine multiple specialized models and verification schemes:

  • Composite Reward Aggregation: LongWriter-Zero employs three orthogonal reward models—length (rlengthr_{\text{length}}), writing quality (rwriter_{\text{write}}), and formatting (rformatr_{\text{format}})—each mapped to normalized group advantage and combined into AfinalA_{\text{final}} (Wu et al., 23 Jun 2025).
  • Pairwise Comparison Rewards: Writing-RL uses a judge's pairwise verdict between generated and reference outputs, assigned {1.0,0.5,0.0}\{1.0, 0.5, 0.0\} and incorporated via the RL surrogate loss (Lei et al., 6 Jun 2025).
  • Verifiable Generative Reward: UloRL introduces a binary generative verifier, assessing semantic equivalence for reasoning tasks such as mathematics and question answering (Du et al., 26 Jul 2025).
  • Planning & Structure via “Think” Prompts: LongWriter-Zero leverages explicit planning stages, scoring intermediate steps and structuring responses with > and <answer> tokens, yielding improved coherence and structure.

    Reward model architectures are a key determinant of learning signals in ultra-long generation, driving progression in output length, quality, and logical structure.

    5. Dynamic Masking and Entropy Management

    To combat output entropy collapse, UloRL develops dynamic masking schemes:

    • Definition of well-Mastered Positive Tokens (MPTs): Tokens with policy probability p(ti)τp(t_i) \geq \tau (e.g., τ=0.99\tau=0.99) in positively rewarded sequences (Du et al., 26 Jul 2025).

    • Dynamic Masking of MPTs (DMMPTs): Masking is activated only when current per-sequence entropy Hˉ\bar H drops below target σ\sigma (e.g., σ=0.2\sigma=0.2):

    Imsk(i,t)={1Hˉi<σ  oitMPTs 0otherwise\mathbb{I}_{\mathrm{msk}}^{(i,t)} = \begin{cases} 1 & \bar H_i < \sigma \ \wedge \ o_i^t \in \mathrm{MPTs} \ 0 & \text{otherwise} \end{cases}

    The surrogate objective is modified accordingly, allowing entropy to stabilize without unbounded drift or collapse.

    Empirical analysis demonstrates that unconditional masking drives entropy upward, while absent masking causes collapse; DMMPTs maintain target entropy absent auxiliary bonus loss terms and preserve exploration for diverse reasoning paths.

    6. Empirical Results and Ablation Analyses

    The principal UloRL variants yield state-of-the-art results across long-form writing, open-ended reasoning, and automated benchmarks:

    Model Length Limit Benchmark Score / Elo Win Rate
    LongWriter-Zero 14K WritingBench 8.69 ≥62%
    Arena-Write (Elo) 1447 98.2%
    Writing-RL Qwen2.5-7B 10K WritingBench 87.23
    LongBench v2 31.8
    UloRL-Qwen3-30B-A3B 128K AIME-2025 85.1
    BeyondAIME 61.9

    Ablation studies underscore:

    • Segment rollout yields >2×\times throughput versus non-segmented approaches (Du et al., 26 Jul 2025).
    • DMMPTs contribute up to 3.6 average points (AIME/BeyondAIME) vs. unmasked objectives.
    • Ultra-long output length correlates with steady accuracy gains, with longer outputs bringing ~10 AVG points improvement.
    • Planning prompts and continual pretraining substantially boost coherence, structure, and Elo.

    These results reflect robust generalization, outperforming models up to 235B parameters.

    7. Future Directions and Insights

    A plausible implication is that UloRL approaches could generalize to hybrid verifier architectures linking generative and symbolic checking for scalable reward functions (Du et al., 26 Jul 2025). Adaptive segment sizing and learned entropy thermostats could further optimize sample efficiency and stability. The finding that RL-trained long-output models generalize well to long-input reasoning tasks suggests a promising avenue for rethinking long-context training protocols (Lei et al., 6 Jun 2025). UloRL thus stands as a paradigm for RL-driven ultra-long generation, transcending SFT’s limitations and establishing benchmarks for reasoning accuracy, structural fidelity, and efficiency in LLM outputs.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Ultra-Long Output Reinforcement Learning (UloRL).