Nemotron 3: Open Hybrid LLM Suite
- Nemotron 3 is a family of open-source large language models characterized by a hybrid Mixture-of-Experts and Mamba–Transformer architecture that supports extreme context lengths.
- The suite encompasses three variants—Nano, Super, and Ultra—each optimized for distinct tasks such as agentic reasoning, multi-step tool use, and efficient low-cost inference.
- Leveraging up to 25 trillion tokens in training, Nemotron 3 employs advanced reinforcement learning, NVFP4 quantization, and open-release policies to drive state-of-the-art performance.
Nemotron 3 is a family of open-source LLMs developed by NVIDIA featuring hybrid Mixture-of-Experts (MoE) Mamba–Transformer architecture, advanced reinforcement learning post-training, and aggressive low-bit quantization. The suite encompasses three models—Nano, Super, and Ultra—each architected and tuned for distinct agentic, reasoning, and multi-step tool-use tasks at varying computational scales. The models integrate sequence modeling, structured expert routing, and support for extreme context lengths, offering best-in-class inference efficiency, flexible post-training control, and comprehensive open access to models, recipes, and associated datasets (NVIDIA et al., 24 Dec 2025, NVIDIA et al., 23 Dec 2025).
1. Model Family, Architecture, and Variants
Nemotron 3 consists of three core variants—Nano (≈30 B parameters, 3 B active), Super (≈70 B, 8 B active), and Ultra (≈200 B, 20 B active)—all built upon a tightly integrated hybrid of Mamba-2 state-space layers, sparse MoE feed-forward layers, and a minimal quota of grouped-query self-attention (GQA) layers. Key distinctions within the family relate to parameter count, MoE expert configuration, active parameter ratio, quantization regime, and inference augmentations (notably, LatentMoE and Multi-Token Prediction in Super/Ultra).
Each model embodies a backbone stack where:
- Mamba-2 layers provide per-step memory and fixed per-token compute, essential for supporting context windows up to 1 M tokens.
- Sparse MoE layers implement parameter scaling via routed expert MLPs, with the routing network and output
where are routing weights, are two-layer expert MLPs, and selects the top- experts per input (NVIDIA et al., 24 Dec 2025, NVIDIA et al., 23 Dec 2025).
- GQA layers (2 KV heads) serve as global routing nodes, crucial given MoE and Mamba layering.
Super and Ultra incorporate LatentMoE, where expert routing and compute occur in a reduced latent space (), reducing all-to-all bandwidth and enabling higher expert scaling at near-constant cost.
Model summary table:
| Variant | Total Params | Active Params | Key Features | Use Cases |
|---|---|---|---|---|
| Nano | ~30 B | ~3 B | Standard MoE-Mamba | Edge/chat/RAG, low-cost, LRLC |
| Super | ~70 B | ~8 B | LatentMoE, MTP, NVFP4 | Collaboration, high-volume agent |
| Ultra | ~200 B | ~20 B | LatentMoE, MTP, NVFP4, SOTA | Frontier reasoning, tool chains |
LRLC: Long-range, long-context.
2. Training Corpus and Data Engineering
Pretraining leverages up to 25 trillion tokens, with >3 T new tokens since Nemotron 2. This includes Nemotron-CC-v2.1 (Common Crawl), Nemotron-CC-Code-v1, Nemotron-CC-Math (over 133B math tokens (Mahabadi et al., 20 Aug 2025)), Wikipedia, high-quality SFT-style prompts, and a multilingual blend (19 languages). Nemotron-CC-Math employs layout-aware rendering (Lynx browser) to preserve equation integrity and code blocks, followed by LLM-driven cleaning and MinHash-LSH deduplication. Statistical summaries for Nemotron-CC-Math-3+:
- 101.15M documents, 133.26B tokens, 980,922 unique domains
- Disciplines: 60.3% mathematics, 11.2% physics, 12.0% CS, with robust representation of statistics, economics, and other sciences
- Quality filtering via a FineMath classifier, fuzzy deduplication, benchmark decontamination
Pretraining on Nemotron-3+ yields substantial downstream improvements: +9.6 on MATH, +14.3 on MBPP+ over prior open math corpora (Mahabadi et al., 20 Aug 2025).
3. Quantization and Efficient Training Techniques
Nemotron 3 Super and Ultra employ native NVFP4 (NVIDIA FP4) quantization:
- 16-element micro-block scaling with E4M3 block scale, E2M1 mantissa, FP32 global scale
- 2D block-scaling for weights, Hadamard transforms for weight gradients, stochastic rounding
- Trailing 15% of layers (notably QKV, output, and Mamba blocks) in BF16 for numerical stability
NVFP4 offers approximately 3× GEMM throughput improvement over FP8 with <1% loss (<0.6% for 200 B models) in final accuracy (NVIDIA et al., 24 Dec 2025).
4. Advanced Post-Training: Multi-Env RL, Chain-of-Thought, and Budget Control
All Nemotron 3 models undergo extensive post-training, including:
- Supervised fine-tuning (SFT): Instruction following, chat, multi-step tool use, math/code, formal reasoning, safety, and multilingual tasks, sequence-packed up to 256 K tokens.
- Multi-environment reinforcement learning (RL): Group Relative Policy Optimization (GRPO) across competitive math/coding, QA, schema output, instruction, long-context retrieval, agentic tool use. The RL loss includes masked importance sampling, with general advantage estimates . NeMo-RL and NeMo-Gym frameworks are open-sourced (NVIDIA et al., 24 Dec 2025).
- Reinforcement learning from human feedback (RLHF): Preference models (GenRM), group-relative length normalization for explicit thinking/answering budget, and conciseness bonuses.
A granular “reasoning budget control” mechanism allows setting a token budget for thinking, after which the model switches from chain-of-thought (> ) to answer generation (NVIDIA et al., 24 Dec 2025, NVIDIA et al., 23 Dec 2025).
AceReason-Nemotron demonstrates that in small/mid-sized models, stage-wise RL (math-only followed by code-only), hard-prompt curriculum, and strict on-policy updates appreciably outperform distillation alone—yielding +14.6–17.2 pp on AIME25 (math) and +5.8–7.4 pp on LiveCodeBench (code) compared to distillation-only baselines (Chen et al., 22 May 2025).
5. Multi-Token Prediction and Long-Context Scaling
Super and Ultra integrate Multi-Token Prediction (MTP) to predict blocks of future tokens at each input position. The MTP loss
yields only marginal incremental compute but adds +1.2% to 5-shot MMLU and +2.4% to average downstream tasks. Speculative decoding achieves 97% acceptance for initial tokens on 8B models (NVIDIA et al., 24 Dec 2025).
Nemotron 3’s long-context support is achieved natively through pretraining (up to 512 K tokens), fine-tuning (256 K), and RL (32 K), with no positional-encoding ramping. Mamba layers’ constant-state property maintains memory per token, supporting up to 1 million tokens. On 1M-token sequences, performance degrades gracefully versus catastrophic collapse in baselines (NVIDIA et al., 24 Dec 2025, NVIDIA et al., 23 Dec 2025).
6. Inference Throughput, Benchmarks, and Empirical Results
Benchmarks on Blackwell Ultra and H200 GPUs demonstrate:
- Nano: 15 K tokens/s (8 K in / 16 K out), 3.3× Qwen3-30B; 2.2× GPT-OSS-20B (NVIDIA et al., 24 Dec 2025, NVIDIA et al., 23 Dec 2025).
- Super: ~10 K tokens/s; Ultra: ~7 K tokens/s; all variants retain ≥1.5× throughput of MoE baselines even at extreme contexts.
- Active parameters per token: 3.2 B MoE + 0.4 B embeddings (≈11.4%), compared to 31.6 B total in Nano.
- FP8 quantization for inference (post-training quantization/PTQ) yields ~99% accuracy retention with up to 2× throughput increase (NVIDIA et al., 23 Dec 2025).
- On standard benchmarks (MMLU-Pro, HumanEval, MBPP, AIME25, GPQA, SWE-Bench, RULER), Nemotron 3 Nano matches or exceeds open competitors at comparable scale, particularly excelling in long-context and reasoning workloads.
Empirical results on small/mid-size models also demonstrate clear state-of-the-art advances on math reasoning and code synthesis when combining Nemotron architecture with staged RL training (Chen et al., 22 May 2025).
7. Open Release Policy and Licensing
NVIDIA provides open access to models, training and post-training code, data generators, and synthetic as well as real data. Nano’s model weights, codebase, and recipes are released under a permissive NVIDIA license; Super and Ultra weights, LatentMoE, MTP modules, NeMo-RL, NeMo-Gym, and multi-env RL data will be released under Apache 2.0. More than 10 trillion pretraining tokens and full evaluation harnesses will be openly available, facilitating further academic and applied investigation (NVIDIA et al., 24 Dec 2025, NVIDIA et al., 23 Dec 2025). The associated math corpus, Nemotron-CC-Math, is also fully open-source (Mahabadi et al., 20 Aug 2025).
References
- NVIDIA Nemotron 3: Efficient and Open Intelligence (NVIDIA et al., 24 Dec 2025)
- Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning (NVIDIA et al., 23 Dec 2025)
- Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset (Mahabadi et al., 20 Aug 2025)
- AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning (Chen et al., 22 May 2025)