Nemotron-Cascade: Scalable Domain-wise RL
- Nemotron-Cascade is a scalable domain-wise reinforcement learning approach that sequentially optimizes tasks across different domains.
- It uses a modular cascade RL workflow to mitigate cross-domain interference and infrastructure bottlenecks in large language models.
- The methodology achieves state-of-the-art results on benchmarks for math proofs, code generation, and software engineering evaluations.
Nemotron-Cascade refers to a scalable methodology for domain-wise reinforcement learning (RL) in the context of large general-purpose LLMs, with an explicit focus on robust reasoning and task specialization across mathematical proof, instruction following, code generation, and software engineering. The Nemotron-Cascade workflow emphasizes sequential, domain-specific RL stages—rather than joint multi-domain optimization—addressing infrastructure bottlenecks, training curriculum complexity, and cross-domain interference frequently encountered in large-scale RLHF and RLVR pipelines. The architecture supports both "instruct" and "deep thinking" modes under a unified schema, and achieves state-of-the-art results on multiple academic benchmarks (Wang et al., 15 Dec 2025).
1. Architectural Principles and Motivation
Nemotron-Cascade is designed to address severe cross-domain heterogeneity intrinsic to general-purpose reasoning models. The primary motivation arises from the observation that reasoning tasks—including Q&A, instruction following, proof generation, competitive coding, and software patching—present domain-specific verification costs and divergent response-length requirements. For example, mathematical proof verification can leverage fast, rule-based verifiers, while code generation necessitates slow execution-based testing, and alignment via RLHF relies on high-latency reward models.
Mixing domains within a single RL stage results in curriculum inefficiencies, infrastructure bottlenecks (waiting on slow verifiers), and brittle hyperparameter tuning. Nemotron-Cascade circumvents these limitations with sequential, domain-wise RL, enabling each stage to utilize custom verifiers, curricula, and reward shaping without cross-domain deadlock.
2. Cascade RL Workflow and Algorithmic Details
The core methodology is Cascade RL—a fixed-order, sequential RL pipeline where each domain is optimized with its own reward function and verifier:
- RLHF (Reinforcement Learning from Human Feedback): General human-preference alignment using a 72B reward model trained on ~82K preference pairs.
- Instruction-Following RL: Enforcement of explicit instruction satisfaction using deterministic verifiers. Rewards combine rule-based IF checks and a scaled RLHF signal.
- Math RL: Symbolic math reasoning with dynamic token-budget curricula and rule-based answer checkers.
- Code RL: Execution-based code verification (unit tests), employing higher sampling temperature for exploration.
- SWE (Software Engineering) RL: Software patching (GitHub repair) tasks with execution-free, hybrid lexical-semantic reward computed by a 72B LLM.
Each stage employs an on-policy, token-level REINFORCE-style objective without a KL penalty. Rollout-level rewards are normalized per-group: The gradient-regularized policy optimization (GRPO) loss is: where is the prompt and is the token at timestep of rollout .
Pseudocode for one RL stage:
1 2 3 4 5 6 |
for step in range(N_steps): q_batch = sample_prompts(D) rollouts = [generate_rollouts(q, G, model=θ) for q in q_batch] rewards = [R(q, o) for (q, o) in zip(q_batch, rollouts)] norm_rewards = normalize(rewards, groupby=q_batch) θ = update_weights(θ, policy_gradient(norm_rewards, rollouts)) |
Prompt sets are disjoint across domains (e.g., removing math prompts from RLHF), reducing destructive interference and catastrophic forgetting. The RL process is resistant to forgetting because the on-policy nature ensures relevant capabilities remain sampled unless they are non-rewarding [(Wang et al., 15 Dec 2025), §4.1.1].
3. Unified Generation Modes and ChatML Templates
Nemotron-Cascade models operate in both instruct and deep thinking modes, selectable at generation time via ChatML template flags:
- “/no_think”: Instruct mode (instant answer, no chain-of-thought).
- “/think”: Deep thinking mode (model emits a
> ...block with explicit reasoning steps prior to final answer).
Batches are split (e.g., RLHF "Half-Half" training) to assign roughly equal probability mass to both prompt types, driving robust cross-mode transfer.
4. Training Regimes, Data, and Curricula
4.1 Supervised Fine-Tuning (SFT)
A two-stage SFT process establishes a broad base:
- Stage 1 (16K tokens): General domain data with math, code, science in thinking mode.
- Stage 2 (32K tokens): Addition of tool-calling and software-engineering data, primarily in thinking mode, but with parallel non-thinking general responses.
Extensive cleaning (n-gram decontamination, cross-validation) and balancing yield approximately 2.8M general, 2.8M math, 1.4M code, and 634K science samples.
4.2 RL Stages
Domain-specific data and hyperparameters:
| RL Stage | Data | Key Hyperparams |
|---|---|---|
| RLHF | 82K preference pairs | 8 rollouts/prompt, max 12K, LR 2e-6 |
| IF-RL | ~100K constraints | 8 rollouts, 2 stages (8K/16K), 2K+1K steps |
| Math RL | 14K Olympiad problems | 3 stages (24K/32K/40K tokens), T=0.8–1.2 |
| Code RL | 9.8K code prompts, unit test suite | Single stage (44K/48K), T=1.0, LR 4e-6 |
| SWE RL | 127K repair tasks | 2 context stages (16K/24K), 16 rollouts |
Dynamic curriculum (e.g., longer token budgets, filtering solved/unsolved items) is employed in Math RL and SWE RL.
5. Empirical Performance and Benchmark Outcomes
Nemotron-Cascade models achieve the following representative results (all on unified instruct/thinking models):
- MMLU/Pro EM: Up to 85.1/77.0 (14B think model), points over SFT teacher.
- ArenaHard alignment: Up to 95.7% (8B).
- IFEval: 90.2% (14B).
- Math (AIME24/25): 90.4%/83.3% (14B).
- Code (LCB v6): 71.1% (8B), 74.6% (14B-think), exceeding DeepSeek-R1-0528.
- SWE-bench Verified: 43.1% (14B), outperforming specialty models >32B.
- Competitive Coding (IOI 2025): "Silver-medal" performance (343.37/438.30 gold threshold) using a feedback-driven test-time scaling mechanism—incorporating up to 128K total tokens, 20 candidates per round, and iterative verdict-embedding in the prompt.
Ablation reveals each RL stage rarely degrades prior-stage benchmarks and frequently confers small cross-domain improvements.
6. Stability, Scalability, and Limitations
Cascade RL’s modular pipeline permits domain-level verifier and curriculum selection, enabling scalable infrastructure: each stage customizes its own batch sizes, token budgets, curriculum schedules, and reward shaping. On-policy GRPO with token-level loss and no KL penalty (vs reference model) yields consistently stable training across all stages.
- Resistance to forgetting: Sequential order (RLHF → IF → Math → Code → SWE) with disjoint prompt pools and on-policy RL results in minimal catastrophic forgetting.
- Unified model advantages: Cross-mode transfer is effective; unified 8B models perform comparably to dedicated "thinking" versions while supporting both generation modes.
- Known limitations: IFEval sometimes degrades after RLHF (prompt overlap), long-context capacity beyond 32K is underutilized (Math/SWE rely on up to 40K), and further improvements require stronger reward models and advanced attention mechanisms.
7. Data Availability and Future Directions
All model weights, SFT, RL-stage recipes, and benchmark scripts are publicly released at https://huggingface.co/collections/nvidia/nemotron-cascade to facilitate community benchmarking and extendibility (Wang et al., 15 Dec 2025). Plans include scaling up reward model capacity, enhancing long-context attention, and developing joint multi-stage RLHF+RLVR optimization. A plausible implication is that modular Cascade RL architectures will enable further advances as task heterogeneity and evaluation complexity increase in next-generation reasoning models.