Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AceReason-Nemotron 1.1

Updated 1 July 2025
  • AceReason-Nemotron 1.1 is a state-of-the-art 7B-parameter language model that advances mathematical and code reasoning through a systematic combination of supervised fine-tuning and multi-stage reinforcement learning.
  • Key technical innovations include empirical insights into SFT data scaling, an optimized staged RL curriculum, and an empirically derived RL temperature tuning rule based on entropy.
  • The model achieves significant performance gains on standard math and programming benchmarks compared to prior models and is openly released with datasets and evaluation code.

AceReason-Nemotron 1.1 is a state-of-the-art reasoning LLM designed to advance mathematical and code reasoning by systematically combining supervised fine-tuning (SFT) with reinforcement learning (RL). It establishes new benchmarks for 7B-parameter models on challenging mathematical and programming tasks, demonstrating both architectural and methodological enhancements. The model and associated datasets are openly available for research and enterprise use.

1. SFT–RL Synergy in Model Development

AceReason-Nemotron 1.1 is constructed by first conducting supervised fine-tuning on a large, diverse, and carefully curated dataset containing both math and code prompts, followed by multi-stage reinforcement learning. SFT leverages responses from strong frontier models and emphasizes prompt diversity and reasoning coverage. Extensive empirical analysis confirms that stronger SFT checkpoints yield correspondingly improved final performance after RL, provided RL training is effective and appropriately tuned. RL further narrows pre-existing performance gaps between different SFT initializations while consistently enhancing the model’s capabilities even when starting from a highly optimized SFT foundation.

The multi-stage RL process is designed as a curriculum. It begins with math-only RL (8k, 16k, 24k tokens), moves to code-only RL (24k, 32k), and concludes with a final math RL phase (32k). This staged approach enables the model to first compress reasoning strategies within shorter contexts before extending to longer-form solutions, resulting in gains across both efficiency and accuracy.

2. Scaling Effects in SFT Data and Training

Performance improvements in AceReason-Nemotron 1.1 are closely linked to SFT data scaling strategies along two axes: the number of unique prompts and the number of generated responses per prompt. Quantitative analysis based on multiple linear regression demonstrates that increasing unique prompt diversity contributes a greater accuracy gain than merely increasing the number of responses per prompt (regression coefficients a=4.831a = 4.831 for unique prompts vs. b=2.635b = 2.635 for responses per prompt). Furthermore, accuracy rises through the first five epochs of SFT and plateaus by the sixth, with later epochs offering additional benefit due to the mitigating effects of exposure bias in autoregressive models.

Overfitting at the SFT stage, rather than harming performance, is found to be helpful for producing more robust reasoning models, as it enhances the model’s resilience to exposure bias. These findings underscore that broad and diverse SFT data provide the essential foundation for subsequent RL-based improvement.

3. Reinforcement Learning: Methods and Temperature Tuning

AceReason-Nemotron 1.1 employs Group Relative Policy Optimization (GRPO), a policy gradient algorithm notable for not requiring a critic model and using group-normalized advantages computed from multiple rollouts per prompt. RL training utilizes a deterministic rule-based verification scheme: in math, rewards are binary based on exact answer matching; in code, passing all test cases is required.

Sampling temperature is a critical hyperparameter for effective RL, dictating the degree of exploration vs. exploitation in rollout generation. The empirical guidance established in the paper is to adjust RL training temperature so that temperature-adjusted model entropy is within the range 0.26–0.38, ideally near 0.3. Settings outside this range either undermine exploration or introduce unproductive stochasticity. This balance was found optimal at T=0.85T=0.85; lower (T=0.6T=0.6) or higher (T=1.0T=1.0) temperatures consistently led to suboptimal learning outcomes due to entropy collapse or reward dilution, respectively. The formal entropy at temperature TT is

HT=ipi(T)logpi(T)H_T = -\sum_{i} p_i^{(T)} \log p_i^{(T)}

where pi(T)=pi1/T/jpj1/Tp_i^{(T)} = {p_i^{1/T}}/{\sum_j p_j^{1/T}}.

4. Benchmark Results and Empirical Gains

AceReason-Nemotron 1.1 achieves substantial improvements over its predecessor and contemporary Qwen2.5-7B-based models on major math and programming benchmarks:

Model AIME24 AIME25 LiveCodeBench v5 LiveCodeBench v6
AceReason-Nemotron 1.0-7B 69.0 53.6 51.8 44.1
AceReason-Nemotron 1.1-7B 72.6 64.8 57.2 52.1
Performance Gain +3.6 +11.2 +5.4 +8.0

The RL-staged 1.1 model consistently outperforms SFT-only and earlier versions by margins up to 16.4% on AIME25 and 8.4% on LiveCodeBench v5, with similar gains for pass@K metrics even at large K. It holds the largest accuracy on AIME25 and LiveCodeBench v6 among publicly available 7B models. Math-only RL phases further improve code reasoning performance, evidencing strong cross-domain generalization effects.

5. Technical Innovations and Experimental Insights

Key methodological contributions of AceReason-Nemotron 1.1 include:

  1. A systematic analysis quantifying the dependent relationship between SFT strength and final RL performance, showing performance differences shrink but do not disappear after RL.
  2. Explicit quantification of data scaling effects, demonstrating the primary benefit of prompt diversity in SFT stage and the positive impact of “helpful overfitting” due to exposure bias.
  3. An empirically derived RL temperature tuning rule, maintaining temperature-adjusted entropy near 0.3 for optimal exploration-exploitation tradeoff.
  4. A stage-wise curriculum for RL, separating domains and gradually increasing allowable output lengths, which is shown to be more effective than non-curriculum or abrupt length increases.
  5. Clarification of overlong filtering's role, with guidance for its application depending on token budget and RL stage.
  6. Demonstration that math-only RL not only improves math reasoning but provides measurable transfer to code reasoning, revealing a significant generalization effect of RL-based curriculum learning.
  7. Open sourcing of weights, datasets, and evaluation code to promote reproducibility.

6. Open Resource Release and Community Impact

AceReason-Nemotron 1.1 is openly released at https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B along with accompanying datasets and evaluation instructions. This enables direct replication of results and fair comparison with other models. The release continues the project's commitment to open scientific progress and allows for broader adoption and scrutiny.

The model’s systematic approach—integrating large-scale, well-scaled SFT, staged RL, temperature and entropy control, and open evaluation—has set a reproducible and effective standard for advancing math and code reasoning capabilities in LLMs with accessible hardware footprints.