Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Published 15 Dec 2025 in cs.CL, cs.AI, and cs.LG | (2512.13607v1)

Abstract: Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.

Abstract PDF Upgrade to Chat

Authors (12)

Summary

The paper introduces Nemotron-Cascade, a framework that uses cascaded, domain-specific reinforcement learning to scale general-purpose reasoning models efficiently.
It employs a sequential RL pipeline with explicit mode control, enhancing performance in tasks like alignment, math reasoning, code generation, and software engineering.
Empirical results demonstrate robust benchmark performance while mitigating catastrophic forgetting and optimizing token efficiency through dynamic curriculum tuning.

Formal Summary of "Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models" (2512.13607)

Introduction and Motivation

The paper introduces Nemotron-Cascade, a training framework designed to develop scalable, general-purpose reasoning LLMs via cascaded, domain-wise reinforcement learning (Cascade RL). The central challenge addressed is the heterogeneity of cross-domain tasks—especially the variability in response lengths and reward mechanisms (e.g., symbolic verification for math, execution-based for code, and human preference models for alignment)—which complicates RL infrastructure, slows training, and hinders efficient curriculum and hyperparameter selection in unified LLM post-training.

Nemotron-Cascade resolves these issues by sequentially applying domain-specific RL, rather than joint RL over mixed prompts, minimizing engineering complexity and improving transparency. Models produced within this framework operate both in "thinking" (deep CoT generation) and "instruct" (direct answer) modes, aiming for robust performance on a spectrum of benchmarks, covering alignment, math and code reasoning, competitive programming, and agentless software engineering (SWE).

Cascade RL Pipeline and Training Methodology

Cascade RL is structured as a strictly sequential RL pipeline—RLHF for alignment and helpfulness, followed by instruction following RL (IF-RL), math RL, code RL, and finally SWE RL. This ordering reduces catastrophic forgetting and allows fine-tuning hyperparameters and curricula to each individual domain. RLHF boosts reasoning abilities beyond mere subjective preference optimization, while subsequent RLVR (reward via symbolic or execution-based verification) rarely degrades prior-stage performance and often leads to additional improvement.

The policy model is trained strictly on-policy using Group Relative Policy Optimization (GRPO), avoiding KL regularization, employing a token-level loss, and leveraging dynamic filtering to maintain reward relevance and training stability. Rewards for alignment tasks are sourced from a scalar output reward model (RM) trained using Bradley-Terry objectives on curated human preference datasets (HelpSteer2/3, WorldPM, and custom synthetic negative samples), with ablations substantiating the importance of RM capacity and data diversity. For math and code RL, deterministic verifiers supply binary rewards, while an LLM-based, execution-free reward model guides SWE RL.

Catastrophic forgetting is mitigated in Cascade RL by (i) policy-dependent data sampling, (ii) the cumulative reward focus of RL, (iii) overlapping reward structures across domains, and (iv) prompt decontamination and careful ordering (from general to specialized). Notably, all RL datasets are disjoint from their SFT analogs.

Data, SFT, and Mode Control

Multi-stage supervised fine-tuning (SFT) initializes the models across general, math, code, science, tool-use, and software engineering tasks. For general-domain SFT, prompts produce parallel responses for both thinking and non-thinking modes using state-of-the-art teacher models (DeepSeek-R1-0528, DeepSeek-V3-0324), ensuring stylistic and qualitative consistency. Math, code, and science domains use DeepSeek-R1-based generations with extended token budgets up to 32K for elaborate CoT responses.

Unified models incorporate explicit control flags (/think and /no_think), appended per user prompt and enabling both local and global generation mode control, in contrast to prior system-prompt driven mode-switching. This schema supports dynamic mode switching in multi-turn dialogue, and the flag-based approach acts as a minimal reliable mechanism for controlling behavior, avoiding template-based ambiguity seen in alternative designs (e.g., Qwen3).

Empirical Results and Benchmark Analysis

Nemotron-Cascade achieves high performance across multiple domains, validated on comprehensive benchmarks:

LiveCodeBench v5/v6/Pro: 14B-Thinking surpasses DeepSeek-R1-0528 (the SFT teacher), Gemini-2.5-Pro, and all recent open LLMs, with 77.5%/74.6% pass@1 on LiveCodeBench v5/v6.
IOI 2025: 14B-Thinking attains silver medal, scoring 343.37 with feedback-driven, self-improving test-time scaling, surpassing OpenAI's internal gold model and other competitive baselines.
SWE-bench Verified: Unified 8B and 14B models obtain 37.2% and 43.1% resolve rates, rivaling dedicated 32B SWE agents (DeepSWE-32B).
Alignment and Reasoning: On IFEval strict, ArenaHard, MMLU, and GPQA-Diamond, both dedicated and unified models are competitive, with RLHF/IF-RL stages producing substantial improvements (ArenaHard, IFEval, IFBench).
Math Reasoning: AIME24/25 scores consistently reach 90% after Math RL, independent of initial model state due to robust curriculum and dynamic filtering.

Stage-wise ablation confirms that Cascade RL preserves or even enhances prior domain performance (i.e., reward structures do not induce negative interference), with fine-grained control over reasoning token-efficiency, entropy, and response length as desired for specific benchmarks.

Engineering Innovations and Implications

Nemotron-Cascade advances practical and theoretical state-of-the-art in LLM RL post-training:

Transparent, Modular Training and Datasets: Full recipes and curated data releases facilitate reproducibility and cross-institutional comparison.
Unified Reasoning Control: Even compact 8B models, if properly staged, can learn to integrate both instruct and deep chain-of-thought competencies, challenging typical assumptions of small LLM limitations for "unified" operation.
Execution-Free Reward Modeling for SWE: Enables scalable RL training for repair tasks, bypassing Docker/agent execution bottlenecks.
Entropy Management and Token Efficiency: RLHF and IF-RL are empirically shown to optimize reasoning trace lengths and temperature/entropy settings are critical for sample-efficient code RL.

Cascade RL’s structural insulation from catastrophic forgetting—due to policy-dependent long-term reward maximization and prompt decontamination—enables continual addition of new domains without performance regression.

Future Directions and Theoretical Considerations

Nemotron-Cascade establishes a robust framework for unified reasoning model development, with significant implications:

Generalizability to New Domains: The pipeline is agnostic to specific reward function modality (preference, verification, execution-free semantic similarity, tool call), suggesting straightforward extensibility to emergent tasks.
Reward Model Quality and Scaling Law: Larger reward model backbones correlatively improve stability, alignment, and reward signal informativeness, with 72B RM yielding best RLHF performance. Further scaling and diversity in preference data will likely drive continued policy improvements.
Interplay of RLHF and RLVR: There is clear complementarity; RLHF increases general alignment and response quality, while RLVR stages enhance accuracy on verifiable reasoning tasks, with minimal conflict.
Long-Context Capabilities and Test-Time Scaling: Input length expansion, token scaling, and self-improving test-time pipelines (with feedback) are instrumental in maximizing real-world performance (e.g., IOI, SWE tasks), even in constrained inference budgets.

Unified models that offer explicit user-mode control, reduced engineering complexity, and post-training transparency may drive next-generation AGI-capable LLMs, as further integration of thinking/instruct capabilities and multi-domain RL become standard. Open RL recipes and empirical scaling in Cascade RL will facilitate convergence toward robust, general-purpose reasoning agents.

Conclusion

Nemotron-Cascade exemplifies how strictly cascaded, domain-wise reinforcement learning can scale general-purpose reasoning LLMs across diverse domains, maintaining or enhancing prior capabilities without catastrophic forgetting. By combining explicit mode control, high-quality transparent data, robust on-policy RL recipes, and execution-free reward models, the framework achieves strong empirical results and offers clear practical and theoretical implications for the future of LLM alignment and reasoning. Further research into large-scale reward modeling, multi-stage RLVR orchestration, and feedback-driven test-time scaling is poised to inform future advances in unified AI reasoning agents.

Markdown Report Issue