AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy (2506.13284v1)

Published 16 Jun 2025 in cs.CL, cs.AI, and cs.LG

Abstract: In this work, we investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models. We begin by curating the SFT training data through two scaling strategies: increasing the number of collected prompts and the number of generated responses per prompt. Both approaches yield notable improvements in reasoning performance, with scaling the number of prompts resulting in more substantial gains. We then explore the following questions regarding the synergy between SFT and RL: (i) Does a stronger SFT model consistently lead to better final performance after large-scale RL training? (ii) How can we determine an appropriate sampling temperature during RL training to effectively balance exploration and exploitation for a given SFT initialization? Our findings suggest that (i) holds true, provided effective RL training is conducted, particularly when the sampling temperature is carefully chosen to maintain the temperature-adjusted entropy around 0.3, a setting that strikes a good balance between exploration and exploitation. Notably, the performance gap between initial SFT models narrows significantly throughout the RL process. Leveraging a strong SFT foundation and insights into the synergistic interplay between SFT and RL, our AceReason-Nemotron-1.1 7B model significantly outperforms AceReason-Nemotron-1.0 and achieves new state-of-the-art performance among Qwen2.5-7B-based reasoning models on challenging math and code benchmarks, thereby demonstrating the effectiveness of our post-training recipe. We release the model and data at: https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B

Summary

The paper introduces a 7B model combining SFT and RL, achieving notable improvements on math and coding benchmarks.
It demonstrates that scaling prompt diversity in SFT yields greater performance gains than simply increasing responses per prompt.
The study shows a stage-wise RL curriculum—especially math-only phases compressing reasoning—that efficiently boosts long-chain problem-solving.

This paper, "AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy" (2506.13284), investigates the interplay between Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to enhance the mathematical and coding reasoning capabilities of LLMs. The work culminates in the AceReason-Nemotron-1.1 7B model, which achieves state-of-the-art performance among Qwen2.5-7B-based models on math and code benchmarks.

The core of the research focuses on optimizing both SFT and RL stages and understanding their synergistic effects.

Supervised Fine-Tuning (SFT) Strategy

The SFT process begins with meticulous data curation.

Prompt Collection and Filtering: Math prompts are sourced from datasets like AceMath, NuminaMath, and OpenMathReasoning. Code prompts come from TACO, APPs, OpenCoder-Stage2, and OpenCodeReasoning. After deduplication and decontamination (filtering samples with 9-gram overlap with test sets), DeepSeek-R1 is used to generate responses. To balance difficulty, simpler prompts (responses < 2000 tokens) are partially filtered out, resulting in 247K math and 136K code prompts.
Scaling SFT Data: The impact of scaling SFT data is explored along two axes:
1. Increasing the number of unique prompts.
2. Increasing the number of generated responses per prompt. Seven SFT datasets (v1 to v7) were created, scaling from 36K to 2.2M total samples. All SFT is initialized from Qwen2.5-Math-7B, with rope_theta modified from 10,000 to 1,000,000 to support a context length of 128K.
SFT Findings:
- Scaling both the number of prompts and responses per prompt significantly improves reasoning performance. However, increasing the number of unique prompts generally yields more substantial gains. A multiple linear regression ( $z = a \cdot \log_2 x + b \cdot \log_2 y + c$ , where $x$ is prompt count and $y$ is responses/prompt) showed coefficients $a=4.831$ and $b=2.635$ , suggesting prompt diversity is more impactful. When collecting diverse prompts becomes difficult, increasing responses per prompt is a practical alternative.
- Performance consistently improves up to the 5th SFT epoch, plateauing between the 5th and 6th epochs. This suggests a degree of "overfitting" can be beneficial for long chain-of-thought (CoT) generation, possibly due to exposure bias in autoregressive models.

Reinforcement Learning (RL) Strategy

The paper employs a stage-wise RL approach using the GRPO (Group-wise Reward Policy Optimization) algorithm.

RL Algorithm (GRPO): Strict on-policy training is used, generating $G=8$ or $16$ rollouts per question in a batch of 128 prompts, followed by a single policy gradient update. The token-level policy gradient loss is used, which rewards longer correct samples more and penalizes longer incorrect samples more harshly. The KL divergence term is removed. The GRPO objective is:

$J_{GRPO}(\theta) = \mathbb{E}_{(q,a)\sim D, \{o_i\}_{i=1}^G\sim \pi_{\theta}(\cdot\mid q)} \left[ \frac{1}{\sum_{i=1}^{G}|o_i|}\sum_{i=1}^{G}\sum_{t=1}^{|o_i|} A_{i,t} \right]$

where $A_{i,t}$ is the token-level advantage, estimated as $\hat{A}_{i,t} = \frac{S_i - \text{mean}(\{S_i\}_{i=1}^G)}{\text{std}(\{S_i\}_{i=1}^G)}$ , with $S_i$ being the reward for rollout $o_i$ .
RL Data Curation: High-quality math and code RL data from AceReason-Nemotron-1.0 is used. Prompt difficulty (not too easy, not too hard) and answer accuracy are crucial. For code RL, test case quality and coverage are vital.
Training Pipeline: A sequential, stage-wise RL approach is adopted, illustrated in Figure \ref{fig:training_pipeline_summary}.

Figure \ref{fig:training_pipeline_summary}: Training Pipeline of AceReason-Nemotron 1.1.

1. Math-only Stage-1 (8K response length): Warm-up phase with simpler questions. Performance may initially dip then recover. This stage is crucial for the model to learn to compress reasoning paths. 2. Math-only Stage-2 (16K response length): Proportion of harder questions increases. Substantial performance improvement observed. 3. Math-only Stage-3 (24K response length): Filters out most simple questions, focusing on ~2500 hard ones. Significant math benchmark improvement. 4. Code-only Stage-I (24K response length): Initiated after math RL for stability. Math RL pre-training helps generalize to code and handle "noisier" code rewards. 5. Code-only Stage-II (32K response length): Epoch-wise filtering strategy removes easy problems solved by the previous epoch's checkpoint. 6. Math-only Stage-4 (32K response length): Final math tuning with challenging questions.

RL Findings & Synergy with SFT:
- SFT Initialization: Stronger SFT models generally lead to better final RL performance, though the performance gap narrows during RL.
- Sampling Temperature: Crucial for balancing exploration and exploitation. A rule of thumb is to set the sampling temperature so that the temperature-adjusted entropy stays around 0.3. Too low (e.g., 0.6) leads to over-exploitation; too high (e.g., 1.0) leads to excessive exploration with low initial rewards. A temperature of 0.85 was found to be effective for their SFT model.
- Importance of Math-only Stage-1 (8K): Even if it initially lowers performance, this stage is vital. It forces the model to compress reasoning, likely because SFT responses from larger teacher models (DeepSeek-R1-671B) can be too verbose for smaller models. Skipping this stage leads to suboptimal outcomes in later stages. Training this stage until benchmark accuracies fully recover isn't necessary; early transition to Stage-2 can be more beneficial.
- Overlong Filtering: This strategy involves masking out samples that exceed the response length budget instead of assigning a negative reward.
- Beneficial with short token limits (e.g., 8K, 16K) as many samples might be truncated, introducing noise if penalized.
- Advantage diminishes at 24K.
- At 32K, not using overlong filtering (i.e., penalizing overlong responses) can be better, as it encourages token efficiency. The model trained without overlong filtering at Stage-4 (32K) performed better, even when inference length was extended to 64K.
- Math-to-Code Generalization: Math-only RL significantly boosts code benchmark performance, even when starting from strong SFT models. Most coding improvement comes from Math-only Stage-2.
- Pass@K Improvements: RL consistently improves pass@K scores (for K from 8 to 128) over strong SFT models, even on models much stronger than the DeepSeek-R1-Distill-Qwen used in prior work. This indicates RL helps the model find correct solutions more reliably across multiple generation attempts.
- Solving Hard Problems: RL enables the model to solve a "long tail" of hard problems that the SFT model fails on, even with many attempts.

Evaluation and Main Results

Benchmarks: AIME24/25, MATH500, HMMT2025, BRUMO2025 (math); EvalPlus, LiveCodeBench v5/v6 (code). Inference uses temperature 0.6, top_p 0.95, max length 32K. Results are reported as pass@1 averaged over $n$ runs (avg@n).
Performance: AceReason-Nemotron-1.1-7B significantly outperforms its SFT starting point (Our SFT-7B) and other 7B-scale models.
- AceReason-Nemotron-1.1-7B achieves 72.6% on AIME24 (avg@64) and 64.8% on AIME25 (avg@64).
- On LiveCodeBench, it scores 57.2% (v5, avg@8) and 52.1% (v6, avg@8).
- The SFT model itself ("Our SFT-7B") is stronger than baselines like DeepSeek-R1-Distill-Qwen-7B, achieving 62.0% on AIME24 and 48.4% on AIME25.
- The RL recipe boosts the SFT model by +10.6% on AIME24, +16.4% on AIME25, +8.4% on LiveCodeBench v5, and +8.3% on LiveCodeBench v6.

Practical Implementation Considerations

SFT Data: Focus on prompt diversity first. If collecting diverse prompts is hard, increase responses per prompt (generated by a capable teacher model). Train for ~5 epochs.
RL Training:
- Use on-policy GRPO with token-level rewards.
- The initial RL stage (e.g., Math-only Stage-1 with 8K context) is important for reasoning compression, even if immediate benchmark gains aren't seen.
- Carefully tune sampling temperature during RL policy rollout to maintain adjusted entropy around 0.3. This might be higher than the optimal inference temperature.
- Apply overlong filtering (masking truncated samples) in early RL stages with shorter context limits. Consider removing it or assigning negative penalties in later stages with longer context limits to encourage conciseness.
- A staged curriculum for RL, gradually increasing response length and problem difficulty, is effective.
- Math-focused RL can generalize and improve coding abilities.

The paper provides a detailed recipe for post-training LLMs for reasoning tasks, emphasizing the systematic paper of SFT data scaling and RL training dynamics. The released model and data facilitate further research and application.

PDF Markdown

Tweets

https://twitter.com/f14bertolotti/status/1934846757982392544

https://twitter.com/fly51fly/status/1936543382379384931

https://twitter.com/papers_anon/status/1934807010022969361

https://twitter.com/HuggingPapers/status/1934864399396995438

https://twitter.com/arxivsanitybot/status/1935334164481425486