Llama-Nemotron: Efficient Reasoning Models (2505.00949v4)
Abstract: We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. The family comes in three sizes -- Nano (8B), Super (49B), and Ultra (253B) -- and performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency. In this report, we discuss the training procedure for these models, which entails using neural architecture search from Llama 3 models for accelerated inference, knowledge distillation, and continued pretraining, followed by a reasoning-focused post-training stage consisting of two main parts: supervised fine-tuning and large scale reinforcement learning. Llama-Nemotron models are the first open-source models to support a dynamic reasoning toggle, allowing users to switch between standard chat and reasoning modes during inference. To further support open research and facilitate model development, we provide the following resources: 1. We release the Llama-Nemotron reasoning models -- LN-Nano, LN-Super, and LN-Ultra -- under the commercially permissive NVIDIA Open Model License Agreement. 2. We release the complete post-training dataset: Llama-Nemotron-Post-Training-Dataset. 3. We also release our training codebases: NeMo, NeMo-Aligner, and Megatron-LM.
Summary
- The paper presents a family of open, heterogeneous LLMs optimized for reasoning via a five-stage training process, incorporating NAS, synthetic data, and RL-based alignment.
- It details innovative methodologies such as block-wise distillation, FFN fusion, and a dynamic toggle for switching between standard chat and detailed multi-step reasoning modes.
- The models achieve state-of-the-art performance on various benchmarks while enabling flexible deployment and supporting open research through permissive licensing and shared codebases.
Thinking...
Understanding Paper's Focus
I need to start with the fundamental goals and contributions of "Llama-Nemotron: Efficient Reasoning Models" (2505.00949). It's crucial to identify the key reasoning models discussed. <br> <br>
Diving Into Llama-Nemotron
I'm now focusing on defining the Llama-Nemotron family, including Nano, Super, and Ultra. I'm noting their key features: they are heterogeneous, efficient reasoning models, and openly licensed. I'm keeping in mind their Llama 3 derivation. Then I'll focus on how they were trained. <br> <br>
Dissecting Model Training Stages
I'm now focusing on the five training stages. Inference optimization, recovery training, SFT, large-scale RL, and final alignment. Understanding each stage's methodology is vital for a comprehensive grasp of Llama-Nemotron's training procedure. <br> <br>
Enumerating Key Paper Components
I'm now focusing on detailing the key elements of the paper. I'll summarize the Llama-Nemotron family, including model training, inference optimization, synthetic data generation, supervised fine-tuning, and reinforcement learning for reasoning. <br> <br>
Focusing on Model Family Details
Now, I'm defining the key features of the Llama-Nemotron models, including their derivations and specific model components. I'm noting the detailed synthetic data creation and training approach. <br> <br>
Summarizing Llama-Nemotron Training
I'm now summarizing the essential Llama-Nemotron training components, detailing the stages from inference optimization to alignment. I'll include synthetic data creation, SFT, and the RL strategies, always keeping in mind the benchmark comparisons and evaluation metrics.
The paper "Llama-Nemotron: Efficient Reasoning Models" (2505.00949) introduces a new family of open, heterogeneous LLMs designed for exceptional reasoning capabilities and inference efficiency. The Llama-Nemotron (LN) series includes three sizes: Nano (8B), Super (49B), and Ultra (253B). These models are derived from Llama 3 variants and optimized through a multi-stage training process focusing on efficiency, knowledge transfer, and reasoning alignment. A notable feature is their support for a dynamic reasoning toggle ("detailed thinking on/off") at inference time, allowing users to switch between standard chat and detailed multi-step reasoning modes without requiring separate models.
The models are released under a commercially permissive NVIDIA Open Model License Agreement, alongside the post-training dataset and relevant training codebases (NeMo, NeMo-Aligner, Megatron-LM), aiming to support open research and development in reasoning models.
The core training procedure consists of five stages:
- Inference Optimization: The LN-Super and LN-Ultra models are optimized for hardware efficiency using the Puzzle framework (2411.19146), a neural architecture search (NAS) method. Puzzle transforms larger Llama 3 models by applying block-wise local distillation to create a library of alternative transformer blocks with varying accuracy-efficiency trade-offs. Key block variants include attention removal and variable FFN dimensions. A mixed-integer programming (MIP) solver selects blocks per layer to meet specific deployment constraints (latency, memory, throughput). For LN-Ultra, an additional technique called FFN Fusion (2503.18908) is used to reduce sequential depth by merging consecutive FFN layers that appear after attention removal.
- LN-Super: Optimized for a single NVIDIA H100 GPU (TP1), achieving a 5x throughput speedup over its Llama 3.3 70B base at batch size 256 and TP1.
- LN-Ultra: Optimized for a full 8xH100 node, achieving a 1.71x latency improvement over its Llama 3.1 405B base and supporting up to 3M FP8 cached tokens.
- Inference efficiency is further boosted by implementing FP8 generation in vLLM, achieving up to 1.8x speedup compared to BF16 generation.
- Post-NAS Training (Knowledge Distillation and Continued Pretraining): After architecture optimization, models undergo training to recover potential quality loss and improve inter-block compatibility.
- LN-Super: Distillation on 40B tokens.
- LN-Ultra: Distillation on 65B tokens followed by 88B tokens of continued pretraining on a general dataset. This stage helps LN-Ultra match or surpass its Llama 3.1 405B base on key benchmarks.
- Supervised Fine-Tuning (SFT): This stage is crucial for transferring reasoning capabilities and training the "detailed thinking on/off" toggle. Models are fine-tuned on a token-level cross-entropy loss using a curated dataset.
- Data includes a mix of standard instruction data and reasoning traces synthesized from strong teacher models like DeepSeek-R1 (2501.12948), tagged with "detailed thinking on" or "detailed thinking off".
- Reasoning data is collected and generated for specific domains:
- Math: Problems extracted from Art of Problem Solving forums, classified, filtered (proofs, MCQs, binaries, invalid), and then solved multiple times by teachers (DeepSeek-R1, Qwen2.5-Math) to produce reasoning traces. Solutions are filtered for correctness.
- Code: Unique competitive programming questions from sources like TACO, APPS, CodeContests, and CodeForces are collected and decontaminated. DeepSeek-R1 generates multiple solutions with reasoning steps enclosed in
>
tags. Solutions are post-processed for correctness and format. Large-scale data was found crucial for performance. > * Science: Open-ended and MCQ questions from StackOverflow and synthetic generation are collected and decontaminated against benchmarks (GPQA, MMLU). DeepSeek-R1 generates reasoning traces. > * General: Synthetic and real-world prompts are used, with responses generated by DeepSeek-R1 and filtered using a reward model. > * Non-reasoning data is created by generating paired responses for prompts from the reasoning dataset using Llama-3.1-Nemotron-70B-Instruct, tagged with "detailed thinking off". > * A Feedback-Edit Inference-Time-Scaling system is used to generate high-quality general-domain responses for the non-reasoning chat data. > * Model-specific SFT procedures vary learning rates, epochs, and data blends to optimize performance for each size. > > 4. Large-Scale Reinforcement Learning (RL) for Reasoning: This stage is applied to LN-Ultra to push its reasoning capabilities beyond its teacher model. The GRPO algorithm (2402.03300) is used, focusing on scientific reasoning (GPQA-Diamond). > * Training leverages accuracy rewards (comparing policy predictions to ground truth) and format rewards (ensuring correct use of<think>
tags). > * Data filtering discards prompts easily solved by an intermediate model (LN-Super) to increase training difficulty. > * Curriculum training using a progressive batching strategy based on estimated problem difficulty (pass rate) is used to stabilize training and improve accuracy. > * This stage requires significant compute (approx. 140k H100 hours) and a specialized infrastructure setup (NeMo-Aligner, vLLM, Megatron-LM) with careful memory management using tensor, sequence, context, pipeline, and data parallelism across 72 nodes of 8xH100 GPUs, including FP8 inference generation. > > 5. RL for Preference Optimization (Alignment): A short RL phase is applied after reasoning RL to improve instruction following and general helpfulness. > * Instruction Following: RLOO algorithm on synthetic multi-instruction prompts boosts IFEval scores. > * RLHF: Iterative online RPO (or GRPO for Ultra) is used with the Llama-3.1-Nemotron-70B-Reward model on the HelpSteer2 dataset to maximize reward. This improves general chat capabilities (Arena Hard) and surprisingly boosts performance on other benchmarks, potentially by improving knowledge utilization. For LN-Nano, offline RPO is used. > > Evaluations: The Llama-Nemotron models are evaluated on a range of reasoning (AIME, GPQA-Diamond, LiveCodeBench, MATH500) and non-reasoning (IFEval, BFCL V2 Live, Arena-Hard) benchmarks. Evaluations are conducted at 32k context length, using temperature 0.6/top-p 0.95 for reasoning-on and temperature 0 for reasoning-off. > > * LN-Nano (8B): Achieves strong reasoning performance for its size, demonstrating effective knowledge transfer via SFT and data curation. RPO improves IFEval scores. > > * LN-Super (49B): Competitive with models in its class, performing on par with Llama 3.3 70B (base) in reasoning-off and outperforming distilled competitors in reasoning-on. It shows a trade-off between IFEval and Arena-Hard scores, addressed partially via model merging. > > * LN-Ultra (253B): Achieves state-of-the-art performance among open models across various benchmarks, notably surpassing DeepSeek-R1 on GPQA. The RL stage is crucial for this achievement. Optimized for efficient deployment on a single 8xH100 node. > > * Judging Capability: Llama-Nemotron models, including LN-Ultra and LN-Super, demonstrate strong performance on the JudgeBench evaluation, outperforming several proprietary and open-source models as LLM-as-a-Judge. > > In conclusion, the Llama-Nemotron series represents a significant contribution to open-source reasoning models, achieving high performance and efficiency through a combination of architecture optimization, large-scale synthetic data distillation via SFT, and advanced reinforcement learning techniques. The release of models, data, and code provides valuable resources for future research and application development. The dynamic reasoning toggle offers a practical way to tailor model behavior based on task requirements.
Related Papers
Tweets
YouTube
HackerNews
- Llama-Nemotron: Efficient Reasoning Models (7 points, 0 comments)