Ultra-Long Output RL: Concepts & Advances

Updated 28 December 2025

Ultra-Long Output RL is a reinforcement learning paradigm enabling LLMs and multimodal agents to generate, reason about, and process ultra-long sequences while maintaining long-horizon coherence.
Key techniques include segment rollouts, adaptive stabilization methods, and memory-augmented architectures that mitigate efficiency bottlenecks in processing ultra-long outputs.
Empirical benchmarks demonstrate significant performance gains in tasks like video QA and long-context reasoning, highlighting UloRL's scalable and robust approach.

Ultra-Long Output Reinforcement Learning (UloRL) comprises a family of reinforcement learning paradigms and techniques designed to enable large models—primarily LLMs and multimodal agents—to generate, reason about, or process outputs of extreme length (typically tens of thousands to millions of tokens or frames). UloRL is characterized by outcome-driven, high-stability reinforcement learning objectives, architectures and memory mechanisms for maintaining long-horizon coherence, and innovations in training protocols that circumvent the efficiency bottlenecks of classical RL frameworks when handling ultra-long sequences. This article surveys the principal UloRL approaches across generative, reasoning, and agentic tasks, highlighting their mathematical formulations, algorithmic strategies, and empirical advancements.

1. Formal RL Objectives and Group-Relative Policy Optimization

UloRL restructures the classical RL problem for ultra-long sequence generation and multi-stage reasoning. The environment is often formalized as a finite-horizon Markov Decision Process (MDP) $(\mathcal S, \mathcal A, T, R, \gamma)$ , where:

State space $\mathcal S$ : Represents a partial sequence or structured history, typically including the prompt or query, past actions (e.g., tool calls, generated tokens), and internal model states (hidden activations or thoughts) (Du et al., 26 Jul 2025, Tian et al., 16 Jun 2025, Wu et al., 23 Jun 2025).
Action space $\mathcal A$ : Token generation ( $V$ ), tool invocations, or planner outputs, depending on the setting.
Transitions $T$ : Deterministic, as new tokens or tool actions are appended to the history.
Reward $R$ : Sparse, computed only at trajectory termination, based on output quality, answer correctness, or composite task-specific metrics.
Discount $\gamma$ : Usually set to $1$, since only the end result matters.

To address variance and instability in traditional policy gradients (e.g., PPO), UloRL employs a group-based surrogate, most commonly Group Relative Policy Optimization (GRPO) (Du et al., 26 Jul 2025, Tian et al., 16 Jun 2025, Wu et al., 23 Jun 2025, Wang et al., 22 Oct 2025, Shen et al., 15 Dec 2025):

$\hat{A}_i = \frac{r_i - \mathrm{mean}(\{r_k\})}{\mathrm{std}(\{r_k\})}$

$\mathcal J_\mathrm{GRPO}(\theta) = \mathbb E_{q, \{o_i\}\sim\pi_{\mathrm{old}}} \left[ \frac{1}{G}\sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min(r_{i,t} \hat{A}_i, \mathrm{clip}(r_{i,t}, 1 - \epsilon, 1+\epsilon) \hat{A}_i) \right]$

where each sample $q$ spawns $G$ candidate rollouts and group-normalized advantages mitigate reward sparsity and task-heterogeneity. Task-specific normalization is adopted for multi-task scenarios (Shen et al., 15 Dec 2025).

2. Segment Rollout and Efficient Ultra-Long Decoding

Standard RL rollouts with ultra-long outputs (e.g., 64–128K tokens) are bottlenecked by the “long-tail” phenomenon—samples of vastly differing decode lengths cause idle resources and delayed updates. UloRL introduces segment rollout, dividing the output into $M$ fixed-length segments (Du et al., 26 Jul 2025):

Generation proceeds chunk-wise, adding each segment to an unfinished pool if the sample is incomplete.
Once a sample reaches EOS or the global maximum, the assembled output is moved to an experience pool for immediate policy update.
This enables gradient updates as soon as any sample finishes, eliminating the need to wait for the slowest rollout.

Empirically, segment rollout achieves up to $2.06\times$ training speedup (e.g., for Qwen3-30B-A3B at 64K tokens, 1 segment: $1240$s/step, 4 segments: $601$s/step) (Du et al., 26 Jul 2025).

Experience pools must manage partial and complete sequences, and importance sampling is adapted (pseudo on-policy or segment-aware) so that per-token policy ratios remain stable despite inter-segment checkpointing.

3. Stabilization and Exploration Strategies in Ultra-Long RL

Model collapse and overfitting (entropy collapse) are exacerbated in outcome-driven RL for ultra-long sequences. Multiple stabilization strategies are deployed:

a) Dynamic Masking of Well-Mastered Positive Tokens

Tokens occurring in high-reward sequences with very high model confidence (e.g., $p_\theta(t|\cdot) \geq \tau = 0.99$ ) are dynamically masked from further gradient updates when the average entropy of the sample falls below a certain threshold (e.g., $\sigma=0.2$ ) (Du et al., 26 Jul 2025). This mechanism (DMMPTs) stabilizes entropy and preserves diversity in the policy's exploration space.

b) Adaptive Entropy-Controlled Policy Optimization (AEPO) and Gradient Clipping

For highly heterogeneous and multi-task UloRL settings, AEPO dynamically masks negative-advantage (exploratory but low-reward) rollouts if the batch entropy exceeds a threshold, but reintroduces them when entropy falls too low, preventing both collapse and stagnation (Shen et al., 15 Dec 2025).

Task-balanced sampling and task-specific advantage normalization are further used to prevent reward sparsity and instability when combining disparate ultra-long datasets (Shen et al., 15 Dec 2025).

c) Truncated Rollouts and Curriculum Scheduling

Dynamic reference scheduling, as in Writing-RL, maintains difficulty-adaptive per-sample curricula, promoting only those samples for which the policy consistently outperforms a reference, thus gradually increasing the challenge level in a asynchronous and robust manner (Lei et al., 6 Jun 2025).

4. Planner and Modular Reasoning Architectures for Ultra-Long Tasks

Ultra-long reasoning tasks often require explicit planning and modular, interpretable execution:

Chain-of-Tool-Thought (CoTT): CoTT mechanisms for agentic settings, notably in Ego-R1, interleave > planning steps with tool calls (<tool .../>) and information assimilation (<information>...</information>) (Tian et al., 16 Jun 2025). The agent only maintains essential intermediate outputs, with hierarchical retrieval (h-rag) to access the relevant time window in massive temporal data (week→day→hour→10min). > > - Think-Answer Blocks for Ultra-Long Writing: LongWriter-Zero and related models encourage explicit <think> blocks, allowing the model to allocate and plan content across subsequent ultra-long <answer> outputs, improving both global structure and local coherence (Wu et al., 23 Jun 2025). > > - Plan–Retrieve–Reason–Recheck Emergence: RL on ultra-long inputs (e.g., in LoongRL's KeyChain tasks) induces multi-phase generative traces: explicit planning, symbolic retrieval, sequential reasoning, and output verification (“recheck”) (Wang et al., 22 Oct 2025). > > These architectures address the critical bottleneck of fitting arbitrarily long context or output within the finite window of transformer models. > > ## 5. Memory-Augmented and Hierarchical Architectures > > When context windows are vastly exceeded (inputs/outputs of 1–4M tokens), memory-augmented agents are introduced (Shen et al., 15 Dec 2025). The agent operates over input splits $\{x_t\}_{t=1}^K$ , updating an explicit memory state $m_t$ and a pointer $p_t$ for tracking relevant locations: > > $(m_t,p_t)\sim \pi_\theta(m_{t-1},p_{t-1},x_t,q_\mathrm{core})$ > > Only the final output is scored, but the agent must iteratively process massive context, summarizing and propagating essential information across recurrent steps (multi-stage fusion). This paradigm enables near-linear scaling with context length, as opposed to the quadratic cost of standard self-attention. > > ## 6. Empirical Benchmarks and Gains > > UloRL methods consistently outperform both standard SFT and baseline PPO/GRPO across a range of ultra-long output and long-context benchmarks: > > | Model/config | Task | Baseline | UloRL score | Gain | > |------------------------------------------|--------------------------|------------|-------------|------| > | Ego-R1 (3B) | Week-long video QA [h] | 32–36% | 46.0% | +10–14| > | Writing-RL (7B/8B) | WritingBench/EQBench | ~82–84 | ~84–87 | +2–4 | > | LongWriter-Zero (32B) | WritingBench | 8.55–8.68 | 8.69 | +0.01–0.14| > | UloRL-A3B-128K (30B) | AIME-2025 | 70.9% | 82.8% | +11.9| > | QwenLong-L1.5 (30B) | LongBench-V2 | 61.92 | 71.82 | +9.90| > > Critically, as output or input length grows—especially beyond 32K tokens—UloRL models retain or improve performance, whereas SFT and classical RL models degrade. Memory-agent frameworks yield up to +18.3 points improvement at 1M-token scales over full-context-only baselines (Shen et al., 15 Dec 2025). > > ## 7. Limitations, Open Challenges, and Prospects > > Despite these advances, UloRL methods face open challenges and avenues for further research: > > - Verifier dependency: Faithful ultra-long evaluation and reward assignment are crucial; generative or LLM-based verifiers are a present bottleneck (Du et al., 26 Jul 2025, Wang et al., 22 Oct 2025). > > - Segment and curriculum selection: Static segmentation or curriculum heuristics may not optimally match the distribution of sample lengths and task difficulty. > > - Computational scaling: While segment rollout and memory agents mitigate quadratic costs, practical limitations on memory and update efficiency remain, especially for 100B+ models or multi-million token tasks. > > Future directions include adaptive segmentation, holistic long-input/long-output RL curricula, joint optimization of architecture and training schedules, long-context mechanism integration (linear/sparse attention), per-segment/hierarchical rewards, and broader application domains such as code synthesis, multi-document reasoning, and scientific report generation (Shen et al., 15 Dec 2025, Du et al., 26 Jul 2025, Lei et al., 6 Jun 2025). > > --- > > Key references: (Du et al., 26 Jul 2025, Shen et al., 15 Dec 2025, Tian et al., 16 Jun 2025, Wu et al., 23 Jun 2025, Lei et al., 6 Jun 2025, Wang et al., 22 Oct 2025).