Nemotron 3: Efficient Hybrid Language Models
- Nemotron 3 family are open, large-scale language models that integrate hybrid Mamba–Transformer architectures with sparse MoE layers to achieve efficient reasoning and extended context handling.
- They are trained using a two-stage regimen on massive token corpora, incorporating long-context mixing and reinforcement learning to optimize agentic and conversational tasks.
- Key innovations include hardware-aware NVFP4 quantization, LatentMoE routing, and grouped-query attention, setting new benchmarks in throughput, accuracy, and scalability.
The Nemotron 3 family constitutes a series of large-scale, open, and highly efficient LLMs engineered by NVIDIA to simultaneously optimize reasoning accuracy, throughput, and long-context capabilities in agentic and conversational artificial intelligence applications. Leveraging a hybrid Mamba–Transformer backbone augmented with sparse Mixture-of-Experts (MoE) layers—and, for larger variants, hardware-optimized quantization and novel expert routing techniques—Nemotron 3 models set new Pareto frontiers for efficiency, context window length, and multi-environment agentic reasoning, with all model artifacts and recipes released under open and commercially permissive licenses (NVIDIA et al., 23 Dec 2025, NVIDIA et al., 24 Dec 2025, Bercovich et al., 2 May 2025).
1. Model Architecture and Variant Taxonomy
The Nemotron 3 family encompasses three principal variants: Nano (∼30B–31.6B parameters), Super (∼72–75B), and Ultra (largest scale, e.g., 235B+). All employ a Mamba–Transformer hybrid architecture with sparse Mixture-of-Experts layers as core capacity scaling elements. The building blocks alternate as follows:
- Mamba-2 Layers: Linear recurrent state-space modules with constant-time and constant-memory forward propagation, obviating quadratic scaling and enabling efficient long-context handling.
- Sparse MoE Layers: Conditional routing of tokens through a subset of experts, by default 128 experts with 6 active per token in Nano, providing ~10× parameter sparsity.
- Grouped-Query Attention (GQA): Self-attention with minimized head and KV scaling interleaved every few layers to maintain information routing with minimal compute overhead.
Notably, in the Nano 30B-A3B variant, the architecture comprises 52 layers, a model dimension of 2688, 64 Mamba heads (head dim 64), and 128 total experts (6 active per token). The MoE routing involves a two-layer MLP and a top-k selection of softmax- or sigmoid-activated gate scores, with optional normalization and a dedicated auxiliary load-balancing loss to prevent expert collapse (NVIDIA et al., 23 Dec 2025, NVIDIA et al., 24 Dec 2025).
A summary table of the main variants follows:
| Model | Params (B) | Active Params (B) | Key Innovations | Target Use |
|---|---|---|---|---|
| Nano 30B-A3B | 31.6 | 3.2 (3.6 incl. emb) | Hybrid Mamba–Transformer, Sparse MoE | Reasoning agents (cost-efficient) |
| Super | ∼72–75 | 8 | LatentMoE, MTP, NVFP4 | Collaborative agentic workloads |
| Ultra | 235+ | — | LatentMoE, MTP, largest expert pools | Maximum accuracy, reasoning SOTA |
Super and Ultra exploit NVFP4 quantization, LatentMoE, and Multi-Token Prediction (MTP) layers for further efficiency and quality enhancements (NVIDIA et al., 24 Dec 2025).
2. Training Regimen and Data Pipeline
Nemotron 3 models are pretrained on massive web, code, and synthetic blends using a two-stage regime (NVIDIA et al., 23 Dec 2025, NVIDIA et al., 24 Dec 2025):
- Stage 1: General corpus mix (23.5T tokens), employing proprietary web-crawl filtering, code (open and synthetic), STEM, and diverse domains.
- Stage 2: Quality-focussed subset (1.5T tokens), emphasized during the terminal 6% of training steps.
- Novelty: Over 3T new unique tokens appear compared to Nemotron 2, including 2.5T from Nemotron-CC-v2.1 and 0.5T from specialized synthetic datasets.
Long-context capability is developed via a targeted "LC-Phase" mixing ultra-long sequences (512K tokens) with standard context (4K) and retrieval QA. Supervised fine-tuning (SFT) is performed on highly agentic, reasoning-centric tasks in mixtures including math, code, tool use, formal proofs, and granular instruction templates with explicit control of reasoning verbosity and budget (NVIDIA et al., 23 Dec 2025).
Reinforcement learning in post-training follows a multi-environment paradigm:
- Multi-environment RL (RLVR): GRPO algorithm with masked importance sampling, synchronous (Nano) or asynchronous actor–learner setups, and on-policy updates.
- RL from Human Feedback (RLHF): Generative reward models (trained from large-scale preferences) and group-relative quality/length controls enable alignment and output conciseness.
- MTP Objective (Super/Ultra): Next-M token prediction loss provides richer rollout planning during RL and supports speculative decoding for latency minimization (NVIDIA et al., 24 Dec 2025).
3. Efficiency Innovations: MoE, LatentMoE, NVFP4, and MTP
- MoE Sparsity: Only a small subset (e.g., 6/128) of experts is active per token, reducing forward-pass compute by ~10× over dense equivalents at the same parameter scale. Load-balancing losses ensure robust utilization and prevent graft failure.
- LatentMoE (Super/Ultra): Expert routing, computations, and projections are conducted in a lower-dimensional latent space. Let , project with , route to experts in , and up-project . This results in reduced memory/comms by and enables a proportionally greater number of experts for a fixed cost. Empirically, LatentMoE increases accuracy by 2–4.6% across MMLU, Math, and Code benchmarks at constant active parameter count (NVIDIA et al., 24 Dec 2025).
- NVFP4 Quantization: Weights, activations, and gradients are quantized to NVFP4 using 2D block scaling and stochastic rounding, with sensitive computations in BF16 or MXFP8. This delivers up to 3× FP8-equivalent throughput at <1% accuracy degradation (NVIDIA et al., 24 Dec 2025).
- Multi-Token Prediction (MTP): Training the model to predict future tokens at each step accelerates speculative decoding and decreases generation latency. MTP layers provide up to +2.8% on MMLU-Pro and +2% on base MMLU compared to single-token baselines (NVIDIA et al., 24 Dec 2025).
4. Performance, Benchmarking, and Long-Context Scaling
Nemotron 3 establishes new efficiency benchmarks within and beyond its parameter class:
- Throughput: Nano 30B-A3B achieves 3.3× the token throughput of Qwen3-30B-A3B-Thinking and 2.2× that of GPT-OSS-20B at 8K+16K token sequences on H200/H100, with only 3.2B (3.6B with embeddings) active parameters per forward pass (NVIDIA et al., 23 Dec 2025).
- Benchmark results: On standard evaluations:
| Benchmark | Nano 30B-A3B | Qwen3-30B-A3B | GPT-OSS-20B |
|---|---|---|---|
| MMLU-Pro | 65.05 | 61.71 | — |
| HumanEval | 78.05 | 70.73 | — |
| GSM8K | 92.34 | 89.01 | — |
| RULER @64K | 87.50 | 63.55 | — |
| AIME25 (no tools) | 89.06 | 85.00 | 91.70 |
| SWE-Bench | 38.76 | 22.00 | 34.00 |
| IFBench | 71.51 | 51.00 | 65.00 |
| RULER-100 @1M | 86.34 | 77.50 | — |
Nano maintains SOTA or competitive accuracy while providing best-in-class efficiency (NVIDIA et al., 23 Dec 2025).
- Long-context scaling: All models avoid RoPE in favor of Mamba-based implicit position encoding, enabling contiguous handling and accuracy retention up to 1M-token contexts (Nano: RULER-100 score at 1M tokens = 86.34; prior dense hybrid at 1M = 23.43) (NVIDIA et al., 24 Dec 2025).
- Agentic and Conversational Reasoning: On multi-turn dialog, tool-use, and agentic reasoning environments, the models achieve 5–10% absolute improvements over transformer and hybrid predecessors (NVIDIA et al., 24 Dec 2025).
5. Comparative Analysis and Related Models
The Nemotron 3 family can be positioned relative to peer architectures and within the broader progression of efficient reasoning models:
- Versus Nemotron 2: Deployment cost is halved for the same parameter class, with gains of 3–10 points in math, code, and long-context accuracy (NVIDIA et al., 23 Dec 2025).
- Versus Qwen3, GPT-OSS, Llama-Nemotron: Nemotron 3 outperforms on reasoning and code tasks, delivers 2–3× greater throughput using hardware-oriented quantization (FP8, NVFP4), and uniquely supports 1M-token sequences. Llama-Nemotron leverages pruning and NAS for efficient Transformer variants but does not employ the MoE/Mamba hybridization central to Nemotron 3 (Bercovich et al., 2 May 2025).
The family generalizes the efficient hybrid design pattern, extending prior results from Nemotron-H (NVIDIA et al., 4 Apr 2025), while adding MoE-based scalability and latency-minimizing features.
6. Implementation, Open Release, and Practical Utilization
All Nemotron 3 models, along with training recipes, quantization scripts, MoE/Mamba layer code, data selection pipelines, and RL frameworks (NeMo-RL, NeMo-Gym), are released under the NVIDIA Open Model License or Apache 2.0 (for code) (NVIDIA et al., 24 Dec 2025). Model artifacts are available on Hugging Face and NVIDIA repositories, supporting integration with neuro-linguistic platforms and large-context deployment. Super and Ultra variants, with full LatentMoE and NVFP4 tooling, follow a staged release.
Runtime features such as a dynamic reasoning-budget controller, “detailed thinking on/off” system prompts for toggling reasoning verbosity, and explicit length controls during inference are supported at the checkpoint/system prompt level in both Nemotron 3 and related Llama-Nemotron models (Bercovich et al., 2 May 2025).
The family is specifically tuned for deployment across heterogeneous workloads, from agentic reasoning in autonomous tools and IT automation, to open-ended multilingual dialog, code synthesis, and long-form retrieval-augmented queries over million-token context windows.
7. Significance and Research Context
Nemotron 3 represents an overview of efficiency-centric architectural choices—hybridizing state-space and attention layers, aggressive sparsity via MoE, hardware-aware quantization, and reinforcement learning at scale—that collectively redefine the throughput–accuracy–context-length Pareto frontier for open, commercially usable models. The architectural choices (e.g., minimal attention, LatentMoE, MTP) are motivated by both hardware bottlenecks (memory, communication) and AI workload trends (agentic reasoning, tool use, multi-turn long-dialog, and context persistence). The open release of not only weights and code, but also recipes and data curation pipelines, positions Nemotron 3 as a foundational reference for subsequent research into efficient, trustworthy, and extensible language agents (NVIDIA et al., 23 Dec 2025, NVIDIA et al., 24 Dec 2025, Bercovich et al., 2 May 2025).