Nemotron 3 Super: Open-Source Hybrid MoE
- Nemotron 3 Super is an open-source, large-scale hybrid Mixture-of-Experts foundation model characterized by its efficient LatentMoE architecture and innovative Mamba-Transformer layer design.
- It employs a two-phase pre-training on a 25T-token corpus along with multi-stage supervised fine-tuning and reinforcement learning to enhance reasoning and tool-use capabilities.
- Optimized for high-throughput deployments, the model achieves competitive benchmark performance and supports multi-step collaborative agents and code-based applications.
Nemotron 3 Super is an open-source, large-scale, hybrid Mixture-of-Experts (MoE) foundation model developed by NVIDIA, uniquely characterized by its efficient LatentMoE architecture, hybrid Mamba-Transformer layer design, advanced quantization, and comprehensive multi-stage post-training regime. Positioned as the flagship agentic reasoning model in the Nemotron 3 family, it is optimized for collaborative agents, high-throughput deployments, and robust performance on a wide spectrum of language, code, tool-use, and reasoning benchmarks (NVIDIA et al., 14 Apr 2026, NVIDIA et al., 24 Dec 2025).
1. Architectural Overview
Nemotron 3 Super embodies a 120B parameter Mixture-of-Experts Transformer, with only 12.7B active parameters per forward pass due to the LatentMoE sparsity mechanism. Its hybrid layer design alternates between Mamba-2 state-space sequence blocks—offering constant-size state for linear-time decoding—and sparse LatentMoE feed-forward layers, punctuated by infrequent standard self-attention “global anchor” layers (using 2-KV Grouped Query Attention). Key hyperparameters are detailed below.
| Configuration Component | Value |
|---|---|
| Total parameters | 120.6B |
| Active parameters (per token) | 12.7B |
| Layers | 88 |
| Model hidden dimension (d) | 4,096 |
| Latent MoE dimension (ℓ) | 1,024 |
| Experts per MoE layer (N′) | 512 |
| Top-K active experts (K′) | 22 |
| Mamba heads/layer | 128 |
| MTP layers (shared) | 2 |
The LatentMoE transformation projects token representations to a lower-dimensional latent space (), routes them to a larger set of experts, and up-projects the result (). This designing principle results in greater expert capacity, higher per-token nonlinearity, and increased efficiency, as all-to-all and weight-read costs are scaled down by while supporting up to 1 million token context lengths without positional embedding layers (NVIDIA et al., 14 Apr 2026).
2. Pre-Training and Optimization
Nemotron 3 Super is trained from scratch using a 25T-token corpus in two phases:
- Phase 1 (20T tokens, 80%): Diverse web crawl, Wikipedia, code, multilingual, academic, mathematical, and synthetic logic data.
- Phase 2 (5T tokens, 20%): High-quality filtered data (Wikipedia, curated books).
Pre-training employs mixed-precision computation: core linear GEMMs use NVIDIA’s block-quantized NVFP4 4-bit format (E2M1, 16-element blocks, E4M3 scaling, FP32 global scale, stochastic gradient rounding), with sensitive layers (last 15%, latent projections, QKV, embeddings, and MTP heads) maintained at BF16. Optimization involves AdamW (, , weight decay ), with a multi-stage learning rate schedule and sliding checkpoint merges for increased convergence stability (NVIDIA et al., 14 Apr 2026, NVIDIA et al., 24 Dec 2025).
A long-context extension phase continues training on up to 1M-token sequence lengths, leveraging highly parallelized training on NVIDIA GB200 hardware (context, tensor, and expert parallelism).
3. Post-Training Regimen
Post-training consists of staged methodologies designed to maximize reasoning and agentic capabilities:
- Supervised Fine-Tuning (SFT): Two-stage SFT over ≈7M samples, with agentic, tool-use, and multi-step reasoning tasks. Stage 1 emphasizes token-level reasoning quality under extended outputs; Stage 2 targets robust performance with long inputs and short outputs.
- Multi-Stage Reinforcement Learning:
- Stage 1: RL from Verifiable Rewards, using asynchronous off-policy PPO variants across 21 environments (math, STEM, code, tool use, instruction following, etc.), with customized reward shaping for output quality and efficiency.
- Stage 2: SWE-RL on end-to-end GitHub issue resolution using the OpenHands agent, with direct binary pass/fail reward from real-world test harnesses.
- Stage 3: RLHF employing a principle-following reward model, drawing on large-scale preference data.
- Final “MTP healing” pass ensures speculative decoding performance is maintained post-RL by fine-tuning MTP heads (NVIDIA et al., 14 Apr 2026).
Nemotron 3 Super natively supports “reasoning control” modes (off/regular/low-effort), with approximately 2% of SFT and RL traces dedicated to enforcing concise and correct outputs under low-effort instructional supervision.
4. Quantization, Inference, and Scalability
Nemotron 3 Super is distributed with multiple quantized and high-precision inference checkpoints:
- FP8/Post-training quantization: All GEMMs converted to FP8 (weights+activations), with state and KV-caches quantized appropriately to balance throughput and latency.
- NVFP4 quantization: Per-block weight scaling minimizes quantization error; selective layer promotions to higher precision ensure accuracy under a compute/memory budget averaging 4.75 bits/param.
- Performance: On 8×B200 GPU servers (vLLM/TRT-LLM backends), achieves 2.2× higher tokens/s throughput than GPT-OSS-120B and 7.5× that of Qwen3.5-122B at 8K in/64K out, scaling near-linearly to 1M-token contexts.
- Cache quantization: Mamba SSM caches employ stochastic FP16 casting (Philox RNG), restoring both accuracy and output verbosity to baseline without retraining (NVIDIA et al., 14 Apr 2026).
5. Benchmark Performance and Comparative Evaluation
Nemotron 3 Super demonstrates competitive results on a broad range of agentic and reasoning evaluations. Accuracy retention under quantization is high, with FP8/NVFP4 variants retaining >99.8% median accuracy across benchmarks.
| Benchmark | N3 Super | Qwen3.5-122B | GPT-OSS-120B |
|---|---|---|---|
| MMLU-Pro (5-shot) | 83.73 | 86.70 | 81.00 |
| GSM8K (8-shot EM) | 90.67 | 90.75 | – |
| HumanEval (pass@1) | 79.40 | – | 76.30 |
| LiveCodeBench v5 | 81.19 | 78.93 | 88.00 |
| SWE-Bench (OpenHands) | 60.47 | 66.40 | 41.9 |
| AA-LCR (long-context) | 58.31 | 66.90 | 51.00 |
| RULER 1M (1M ctx) | 71.00 | 91.33 | 22.30 |
| MMLU-ProX (multilingual) | 79.36 | 85.06 | 76.59 |
Empirically, the model supports up to 1M-token context with moderate degradation in specific benchmarks, and demonstrates clear per-token efficiency gains from the LatentMoE technique (2–3% over standard MoE baselines of equivalent active size) (NVIDIA et al., 14 Apr 2026, NVIDIA et al., 24 Dec 2025).
6. Deployment, Open-Source Ecosystem, and Use Cases
Nemotron 3 Super's model weights (base, post-trained, quantized), training/optimization codebases, and all permissible datasets are public on HuggingFace and NVIDIA's GitHub portal, including the NeMo Data Designer, NeMo Gym, and NeMo RL frameworks.
Deployment is optimized for NVIDIA Blackwell Ultra (multi-GPU/NVLink) or A100 clusters, supporting standard MoE tensor-parallelism, context parallelism, and highly efficient NVFP4 inference. Native support is provided for user-specified reasoning budgets (chain-of-thought termination via </think> token) (NVIDIA et al., 24 Dec 2025).
Primary use cases include multi-step collaborative agents, end-to-end IT ticket automation, multi-tool reasoning pipelines, code reasoning, and general-purpose question answering. Internal evaluations show substantial improvements in administrative and code-based workflows compared to prior family models, with marked reductions in latency and human escalation rates (NVIDIA et al., 24 Dec 2025).
7. Comparison to Related MoE and Reasoning Models
Nemotron 3 Super is distinguished from other contemporary open MoE and reasoning-focused models by its:
- Hybrid Mamba-Transformer design for linear-time sequence modeling and efficient memory use.
- LatentMoE architecture, enabling orders-of-magnitude more expert capacity under fixed bandwidth/FLOP constraints.
- Multi-Token Prediction (MTP) layers and native speculative decoding, improving throughput for long-form and agentic tasks.
- Training in native NVFP4, with quantization pipelines ensuring <1% accuracy loss at 4–8 bits/param.
Comparable lines such as Llama-Nemotron Super (49B) employ heterogeneous NAS-based Transformer blocks derived from Llama 3 with inference-optimized architectures, dynamic reasoning toggles, and specialist post-training, but with lower parameter scale and context capacity (e.g., 128K max tokens, limited MoE) (Bercovich et al., 2 May 2025). A plausible implication is that Nemotron 3 Super, through its hardware-aware design and comprehensive multi-environment RL, provides an optimal balance of accuracy, reasoning capability, and production deployment efficiency for large-scale agentic applications.
References: (NVIDIA et al., 14 Apr 2026, NVIDIA et al., 24 Dec 2025, Bercovich et al., 2 May 2025)