AceReason-Nemotron-1.1 7B: Math & Code Reasoning
- AceReason-Nemotron-1.1 7B is an open-source large language model with 7 billion parameters that leverages a synergy of supervised fine-tuning and reinforcement learning to excel in mathematical and code reasoning tasks.
- It employs a two-stage development strategy that first uses high-quality SFT for strong reasoning and then applies GRPO-based RL with precise entropy management to optimize long-context outputs.
- Benchmark results demonstrate state-of-the-art performance on rigorous math and coding challenges, making it a replicable and accessible model for advancing verifiable AI research.
AceReason-Nemotron-1.1 7B is a 7-billion-parameter open-source LLM designed to advance state-of-the-art performance on mathematical and code reasoning tasks. Developed by leveraging a meticulous synergy of supervised fine-tuning (SFT) and large-scale reinforcement learning (RL), the model is recognized for its methodologically transparent improvements, outstanding benchmark achievements, and public release of weights and training data for academic and professional communities.
1. Model Foundations and Development Strategy
AceReason-Nemotron-1.1 7B is built upon the Qwen2.5-7B backbone and integrates model development insights from previous iterations (e.g., AceReason-Nemotron-1.0) and related models such as DeepSeek-R1-Distill-Qwen-7B. The training pipeline is characterized by a two-stage process. First, high-quality SFT establishes a strong reasoning capacity using distilled responses from larger models (e.g., DeepSeek-R1-671B). Following SFT, the model undergoes RL with verification-based rewards, utilizing both rule-based answer checking (for mathematics) and test-case validation (for code problems).
A key methodological innovation is the understanding and utilization of the synergy ("Editor's term") between SFT and RL. Empirical evidence demonstrates that a stronger SFT initialization nearly always leads to superior RL outcomes, provided RL training is conducted with careful entropy management. Furthermore, RL robustly narrows the performance gap between strong and weak SFT initializations, indicating a strong synergistic effect.
SFT data curation leverages two scaling strategies:
- Scaling the Number of Prompts: Increasing unique, challenging prompts has the greatest impact on downstream benchmark improvement.
- Scaling the Number of Responses per Prompt: Exposing the model to several solution pathways per problem aids generalization but has a relatively smaller effect compared to prompt diversity.
Both axes result in measurable performance gains, but the model’s development shows that prompt count scaling dominates (linear regression coefficient for performance vs. log2(#prompts) is higher than for log2(#responses)).
2. Reinforcement Learning Process and Technical Foundations
The RL phase uses the Group Relative Policy Optimization (GRPO) framework, which enables sample-efficient and stable RL for long-form chain-of-thought tasks without requiring a value function or separate critic network. The GRPO loss can be formalized as: where rollouts are generated from the current policy, and token-level advantage is estimated by normalizing the group reward.
A central technical insight is the management of exploration and exploitation during RL:
- Output entropy is regulated through the sampling temperature.
- Optimal progress is found when the temperature-adjusted entropy is kept around 0.3.
- Too low entropy () stifles exploration while too high entropy () produces unproductive samples.
- Empirically, this is achieved with a temperature near 0.85 for Qwen2.5 SFT initializations.
RL is scheduled as a curriculum: initial math stages are conducted with progressively increasing output lengths (e.g., 8K to 32K tokens), followed by extended code RL and finally a return to long-form math RL. Response length is increased in stages to avoid convergence issues and facilitate adaptation to complex tasks.
3. Data Engineering, Curation, and Verification
The data pipeline incorporates a broad and challenging collection of prompts for both mathematics and code reasoning:
- Math Data is curated from sources such as DeepScaler and NuminaMath, filtered to eliminate ambiguous, short, poorly formulated, or contaminated problems. Only problems verified as solvable by state-of-the-art models (e.g., DeepSeek-R1) are retained.
- Code Data is drawn from competitive programming platforms. Problems with non-deterministic, weak, or ambiguous test cases are excluded. Only single-solution, strictly test-case-verified problems are retained.
- Problems are further categorized by difficulty and filtered to maintain a focus on hard-to-solve examples, which drive progress in later RL stages.
Reward verification is strictly binary—either a math answer matches rule-based extraction, or code passes all test cases. The pipeline is explicitly designed to eliminate reward signal "noise", mitigating the risk of destabilizing parameter updates during RL.
4. Performance and Empirical Results
AceReason-Nemotron-1.1 7B achieves state-of-the-art performance among Qwen2.5-7B-based models on key benchmarks:
Model | AIME24 (%) | AIME25 (%) | LiveCodeBench V5 (%) | LiveCodeBench V6 (%) |
---|---|---|---|---|
DeepSeek-R1-Distill-Qwen-7B | 55.5 | 39.0 | 37.6 | 34.1 |
AceReason-Nemotron-1.0-7B | 69.0 | 53.6 | 51.8 | 44.1 |
AceReason-Nemotron-1.1-7B | 72.6 | 64.8 | 57.2 | 52.1 |
AceReason-Nemotron-1.1 7B outperforms prior iterations and all competitive 7B models, especially on contamination-free and long-form reasoning tasks (e.g., AIME25, LiveCodeBench V6). The model’s improvement is particularly pronounced in long-tail, complex math and code problems, with robust pass@k accuracy.
Findings confirm that well-configured RL enables not only enhanced pass@1 accuracy but also improved pass@k at large values—contradicting earlier beliefs that RL only aids best-of-n sampling and does not benefit coverage.
5. Technical Advances and Methodological Insights
Significant technical advances underpin AceReason-Nemotron-1.1 7B:
- SFT and RL Synergy: SFT initialization quality is consistently predictive of final RL performance, and RL further narrows the SFT gap.
- RL Curricula: Progressive length extension and staged domain focus (math preceding code) enable the model to adapt its reasoning strategies dynamically.
- Entropy Management: Directly monitoring and tuning generation entropy via temperature is crucial for productive RL; a target entropy of 0.3 is an effective heuristic.
- Validation Practices: Response length filtering in early RL prevents degenerate behaviors, while its omission in late RL leads to more concise, practical outputs.
- Cross-Domain Transfer: RL on math tasks generalizes to improved code performance, supporting the hypothesis that complex symbolic reasoning underpins both domains.
6. Release, Accessibility, and Ecosystem Support
AceReason-Nemotron-1.1 7B and its corresponding SFT/RL training data are openly released for academic and research use at:
https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B
The release includes model weights, SFT datasets, and evaluation scripts, supporting direct reproducibility and extension.
7. Broader Context and Impact
AceReason-Nemotron-1.1 7B marks a methodological advance in constructing compact, high-performance reasoning models. It demonstrates that a carefully engineered sequencing of SFT and RL—particularly with attention to data scales, sampling entropy, and curriculum design—allows relatively small models to rival larger or proprietary systems on rigorous, long-context reasoning tasks in both mathematics and code. This provides a replicable reference point for subsequent research in scalable LLM reasoning benchmarks and downstream applications requiring verifiable, high-confidence outputs.