Papers
Topics
Authors
Recent
2000 character limit reached

Nemotron-Cascade: Scalable Domain-wise RL

Updated 16 December 2025
  • Nemotron-Cascade is a scalable domain-wise reinforcement learning approach that sequentially optimizes tasks across different domains.
  • It uses a modular cascade RL workflow to mitigate cross-domain interference and infrastructure bottlenecks in large language models.
  • The methodology achieves state-of-the-art results on benchmarks for math proofs, code generation, and software engineering evaluations.

Nemotron-Cascade refers to a scalable methodology for domain-wise reinforcement learning (RL) in the context of large general-purpose LLMs, with an explicit focus on robust reasoning and task specialization across mathematical proof, instruction following, code generation, and software engineering. The Nemotron-Cascade workflow emphasizes sequential, domain-specific RL stages—rather than joint multi-domain optimization—addressing infrastructure bottlenecks, training curriculum complexity, and cross-domain interference frequently encountered in large-scale RLHF and RLVR pipelines. The architecture supports both "instruct" and "deep thinking" modes under a unified schema, and achieves state-of-the-art results on multiple academic benchmarks (Wang et al., 15 Dec 2025).

1. Architectural Principles and Motivation

Nemotron-Cascade is designed to address severe cross-domain heterogeneity intrinsic to general-purpose reasoning models. The primary motivation arises from the observation that reasoning tasks—including Q&A, instruction following, proof generation, competitive coding, and software patching—present domain-specific verification costs and divergent response-length requirements. For example, mathematical proof verification can leverage fast, rule-based verifiers, while code generation necessitates slow execution-based testing, and alignment via RLHF relies on high-latency reward models.

Mixing domains within a single RL stage results in curriculum inefficiencies, infrastructure bottlenecks (waiting on slow verifiers), and brittle hyperparameter tuning. Nemotron-Cascade circumvents these limitations with sequential, domain-wise RL, enabling each stage to utilize custom verifiers, curricula, and reward shaping without cross-domain deadlock.

2. Cascade RL Workflow and Algorithmic Details

The core methodology is Cascade RL—a fixed-order, sequential RL pipeline where each domain is optimized with its own reward function and verifier:

  1. RLHF (Reinforcement Learning from Human Feedback): General human-preference alignment using a 72B reward model trained on ~82K preference pairs.
  2. Instruction-Following RL: Enforcement of explicit instruction satisfaction using deterministic verifiers. Rewards combine rule-based IF checks and a scaled RLHF signal.
  3. Math RL: Symbolic math reasoning with dynamic token-budget curricula and rule-based answer checkers.
  4. Code RL: Execution-based code verification (unit tests), employing higher sampling temperature for exploration.
  5. SWE (Software Engineering) RL: Software patching (GitHub repair) tasks with execution-free, hybrid lexical-semantic reward computed by a 72B LLM.

Each stage employs an on-policy, token-level REINFORCE-style objective without a KL penalty. Rollout-level rewards rir_i are normalized per-group: r~i=rimean({rj})std({rj})\tilde r_i = \frac{r_i - \mathrm{mean}(\{r_j\})}{\mathrm{std}(\{r_j\})} The gradient-regularized policy optimization (GRPO) loss is: JGRPO(θ)=i=1Gt=1Tir~ilogπθ(oi,tq,oi,<t)J_{\mathrm{GRPO}}(\theta) = -\sum_{i=1}^G \sum_{t=1}^{T_i} \tilde r_i \log \pi_\theta(o_{i,t} \mid q, o_{i,<t}) where qq is the prompt and oi,to_{i,t} is the token at timestep tt of rollout ii.

Pseudocode for one RL stage:

1
2
3
4
5
6
for step in range(N_steps):
    q_batch = sample_prompts(D)
    rollouts = [generate_rollouts(q, G, model=θ) for q in q_batch]
    rewards = [R(q, o) for (q, o) in zip(q_batch, rollouts)]
    norm_rewards = normalize(rewards, groupby=q_batch)
    θ = update_weights(θ, policy_gradient(norm_rewards, rollouts))

Prompt sets are disjoint across domains (e.g., removing math prompts from RLHF), reducing destructive interference and catastrophic forgetting. The RL process is resistant to forgetting because the on-policy nature ensures relevant capabilities remain sampled unless they are non-rewarding [(Wang et al., 15 Dec 2025), §4.1.1].

3. Unified Generation Modes and ChatML Templates

Nemotron-Cascade models operate in both instruct and deep thinking modes, selectable at generation time via ChatML template flags:

  • “/no_think”: Instruct mode (instant answer, no chain-of-thought).
  • “/think”: Deep thinking mode (model emits a > ... block with explicit reasoning steps prior to final answer).

Batches are split (e.g., RLHF "Half-Half" training) to assign roughly equal probability mass to both prompt types, driving robust cross-mode transfer.

4. Training Regimes, Data, and Curricula

4.1 Supervised Fine-Tuning (SFT)

A two-stage SFT process establishes a broad base:

  • Stage 1 (16K tokens): General domain data with math, code, science in thinking mode.
  • Stage 2 (32K tokens): Addition of tool-calling and software-engineering data, primarily in thinking mode, but with parallel non-thinking general responses.

Extensive cleaning (n-gram decontamination, cross-validation) and balancing yield approximately 2.8M general, 2.8M math, 1.4M code, and 634K science samples.

4.2 RL Stages

Domain-specific data and hyperparameters:

RL Stage Data Key Hyperparams
RLHF 82K preference pairs 8 rollouts/prompt, max 12K, LR 2e-6
IF-RL ~100K constraints 8 rollouts, 2 stages (8K/16K), 2K+1K steps
Math RL 14K Olympiad problems 3 stages (24K/32K/40K tokens), T=0.8–1.2
Code RL 9.8K code prompts, unit test suite Single stage (44K/48K), T=1.0, LR 4e-6
SWE RL 127K repair tasks 2 context stages (16K/24K), 16 rollouts

Dynamic curriculum (e.g., longer token budgets, filtering solved/unsolved items) is employed in Math RL and SWE RL.

5. Empirical Performance and Benchmark Outcomes

Nemotron-Cascade models achieve the following representative results (all on unified instruct/thinking models):

  • MMLU/Pro EM: Up to 85.1/77.0 (14B think model), +1011+10\sim11 points over SFT teacher.
  • ArenaHard alignment: Up to 95.7% (8B).
  • IFEval: 90.2% (14B).
  • Math (AIME24/25): 90.4%/83.3% (14B).
  • Code (LCB v6): 71.1% (8B), 74.6% (14B-think), exceeding DeepSeek-R1-0528.
  • SWE-bench Verified: 43.1% (14B), outperforming specialty models >32B.
  • Competitive Coding (IOI 2025): "Silver-medal" performance (343.37/438.30 gold threshold) using a feedback-driven test-time scaling mechanism—incorporating up to 128K total tokens, 20 candidates per round, and iterative verdict-embedding in the prompt.

Ablation reveals each RL stage rarely degrades prior-stage benchmarks and frequently confers small cross-domain improvements.

6. Stability, Scalability, and Limitations

Cascade RL’s modular pipeline permits domain-level verifier and curriculum selection, enabling scalable infrastructure: each stage customizes its own batch sizes, token budgets, curriculum schedules, and reward shaping. On-policy GRPO with token-level loss and no KL penalty (vs reference model) yields consistently stable training across all stages.

  • Resistance to forgetting: Sequential order (RLHF → IF → Math → Code → SWE) with disjoint prompt pools and on-policy RL results in minimal catastrophic forgetting.
  • Unified model advantages: Cross-mode transfer is effective; unified 8B models perform comparably to dedicated "thinking" versions while supporting both generation modes.
  • Known limitations: IFEval sometimes degrades after RLHF (prompt overlap), long-context capacity beyond 32K is underutilized (Math/SWE rely on up to 40K), and further improvements require stronger reward models and advanced attention mechanisms.

7. Data Availability and Future Directions

All model weights, SFT, RL-stage recipes, and benchmark scripts are publicly released at https://huggingface.co/collections/nvidia/nemotron-cascade to facilitate community benchmarking and extendibility (Wang et al., 15 Dec 2025). Plans include scaling up reward model capacity, enhancing long-context attention, and developing joint multi-stage RLHF+RLVR optimization. A plausible implication is that modular Cascade RL architectures will enable further advances as task heterogeneity and evaluation complexity increase in next-generation reasoning models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Nemotron-Cascade.