Nemotron 3 Nano: Hybrid MoE LLM
- Nemotron 3 Nano is a 30-billion parameter large language model that integrates Mamba-2 state-space layers, transformer self-attention, and sparse MoE FFNs for enhanced throughput and long-context capabilities.
- It employs a robust training regimen with massive pretraining, supervised fine-tuning on diverse datasets, and multi-environment reinforcement learning to boost reasoning, coding, and multimodal performance.
- Advanced quantization strategies, including NVFP4 and QAD, recover over 95% benchmark accuracy while significantly lowering memory costs and achieving up to 1.8M tokens/sec throughput per GPU.
Nemotron 3 Nano is a 30-billion-parameter LLM employing a Mixture-of-Experts (MoE) hybrid Mamba-Transformer architecture, developed by NVIDIA for high-throughput, cost-efficient inference with advanced reasoning, agentic behavior, and long-context support. As the foundational model for recent agentic, code, and multimodal systems, it sets a Pareto frontier for active parameter efficiency and hardware utilization in the open-weight LLM ecosystem (NVIDIA et al., 23 Dec 2025, NVIDIA et al., 24 Dec 2025, Xin et al., 27 Jan 2026, NVIDIA et al., 27 Apr 2026, Reda et al., 25 Jun 2026).
1. Model Architecture and MoE Hybridization
Nemotron 3 Nano integrates Mamba-2 state-space layers with standard Transformer self-attention layers and sparse MoE feed-forward networks in an interleaved stack. Its typical backbone consists of 52 layers: 23 Mamba-2 layers, 6 attention layers (often grouped), and 23 MoE layers for a total of approximately 30–31.6 billion parameters, of which ~3.6 billion are "active" per token due to MoE sparsity (NVIDIA et al., 23 Dec 2025, Reda et al., 25 Jun 2026). The key architectural principles include:
- Mamba-2 layers: State-space models providing constant-memory O() recurrences for long-range dependencies without resorting to rotary encodings or classic position embeddings.
- Self-attention blocks: Grouped-query attention (GQA) with e.g., 32 (query) × 2 (key/value) heads and head dimension ~128.
- MoE FFN Blocks: Gating routers select top- out of experts per token (e.g., , typical for Nano 30B-A3B), using softmax gating and squared-ReLU or other custom nonlinearities.
All primitive types share a hidden dimensionality in the 2–4k range (e.g., –$4096$), with MoE expert FFNs having local dimensions up to ~16k. The MoE blocks use GShard-style load balancing loss with DeepSeek-inspired aux-loss-free regularization to prevent expert collapse. The model's parameter count is summarized in the table:
| Layer Type | Count | Hidden Dim | Experts/Block | Active/tok |
|---|---|---|---|---|
| Transformer/self-attention | 6–8 | 4096 | N/A | Full |
| Mamba-2 SSM | 20–30 | 4096 | N/A | Full |
| MoE FFN | ~20–30 | 4096/1856 | 128 | 6 |
This hybridization enables efficient context extension and expert capacity (NVIDIA et al., 23 Dec 2025, Reda et al., 25 Jun 2026).
2. Training Regimen: Pretraining, SFT, and RL
The Nemotron 3 Nano training pipeline involves three distinct phases:
- Massive pretraining: Up to 25 trillion tokens, with 4T+ CommonCrawl, 2T code, over 3T new or specialized tokens (including math and synthetic agentic dialogues). The cross-entropy objective is augmented with a load-balance loss for MoE utilization (NVIDIA et al., 23 Dec 2025, NVIDIA et al., 24 Dec 2025, Reda et al., 25 Jun 2026).
- Supervised fine-tuning (SFT): Over 18 million examples covering math, chain-of-thought (CoT), code, multilinguality, tool use, proofs, and synthetic agentic tasks. SFT data incorporate both CoT-augmented and "truncated" reasoning traces to support reasoning toggles and budget-aware control. The SFT stage involves context expansion up to 256k tokens and batch sizes of hundreds to several thousand (NVIDIA et al., 23 Dec 2025, NVIDIA et al., 24 Dec 2025).
- Multi-environment RL: A Generative Reward-PPO (GRPO) or RLVR approach using masked importance sampling, across code execution, tool use, hard math/coding, verifiable QA, and conversation. Policy rollout and optimization maintain frozen router weights for MoE stability and append regularization or bonuses (e.g., for chain-of-thought brevity). RLHF is layered atop RLVR via generative reward models and circular pairwise comparisons (NVIDIA et al., 23 Dec 2025, NVIDIA et al., 24 Dec 2025).
For long-context capability, Nemotron 3 Nano employs a curriculum culminating in a "CPT" (context parallel training) phase with up to 1M-token simulated contexts and mixed 4k/512k sequences, leveraging 8-way context parallelism and 8-way expert/tensor parallel scaling.
3. Quantization Methodologies: NVFP4 and QAD Recovery
Reducing inference cost is critical; Nemotron 3 Nano employs multi-tier quantization:
- NVFP4 Quantization: A 4-bit floating-point format with (sign, 2-bit exponent, 1-bit mantissa), leveraging local per-block FP8 scales () and a global FP32 scale (), enabling 0 reduction from BF16 and 1 higher throughput on NVIDIA Blackwell NVFP4 cores. Quantization is first applied post-training (PTQ) to all weights except a few stabilizing attention/Mamba layers (kept in BF16). Activations are quantized layerwise at runtime (Xin et al., 27 Jan 2026).
- Quantization-Aware Distillation (QAD): To recover accuracy lost in PTQ and avoid instability in quantization-aware training (QAT), QAD distills a full precision teacher into the NVFP4-quantized student using KL divergence between softmax outputs:
2
Only soft logits from the teacher are required, and QAD runs as a single posttraining stage on 3B SFT/RL tokens, robust to data quality and source. QAD recovers 4 of BF16 benchmark accuracy and yields superior performance to QAT—especially for RL-heavy skills (Xin et al., 27 Jan 2026).
| Variant | AA-LCR | AIME25 | GPQA-D | LiveCode-v5 | SciCode |
|---|---|---|---|---|---|
| BF16 | 35.9 | 89.1 | 73.0 | 72.1 | 33.0 |
| NVFP4 PTQ | 31.3 | 85.0 | 71.6 | 68.9 | 30.5 |
| QAD (SFT+RL mix) | 34.3 | 87.9 | 72.7 | 68.9 | 32.3 |
| QAT | 24.8 | 83.3 | 66.0 | 62.0 | 25.8 |
QAD is thus a best-practice for quantized post-training in large hybrid models (Xin et al., 27 Jan 2026).
4. Performance Benchmarks and Efficiency Analysis
Nemotron 3 Nano achieves best-in-class open-model performance on reasoning, coding, math, and long-context benchmarks:
- MMLU-Pro (5-shot): 78.3%
- AIME25: 89.06%
- LiveCodeBench: 68.25%
- MiniF2F pass@1: 50.03%
- RULER-100 @1M tokens: 86.34%
MoE and hybridization yield a 3.3×–3.8× real throughput gain versus dense 30B models (e.g., GPT-OSS-20B, Qwen3-30B-A3B), with context length scaling up to 1M tokens (NVIDIA et al., 23 Dec 2025, NVIDIA et al., 24 Dec 2025). Memory and compute costs are 5 to 6 those of FP16/FP8 models under NVFP4. Inference on Blackwell GB300 achieves up to 1.8M tokens/sec per GPU at NVFP4, with total weight storage 7GB for 1M-token contexts (NVIDIA et al., 24 Dec 2025).
5. Advanced Features: Long-Context, Multimodality, and Parallel Generation
Nemotron 3 Nano supports features that distinguish it from previous large models:
- Long-context handling: Elimination of rotary position encodings; Mamba state-space recurrences enable native extrapolation to 81M context. KV cache and memory scale only with the handful of attention layers, not model depth.
- Multimodal expansion: Nemotron 3 Nano Omni, based on the Nano 30B-A3B backbone, integrates vision (ViT), audio (FastConformer), and video into a unified context using patch/fusion, Conv3D temporal compression, and Efficient Video Sampling (EVS) to reduce memory and attention costs. Quantized NVFP4 deployment achieves 9 original memory with 0 accuracy loss; real-world throughput is 1–2 that of comparably sized models on key multimodal tasks (NVIDIA et al., 27 Apr 2026).
- Diffusion language modeling: As the frozen context tower in Nemotron-TwoTower, Nano 30B-A3B supports blockwise masked denoising, delivering 3 generation throughput for a 4 accuracy drop (Reda et al., 25 Jun 2026).
6. Use Cases, Model Releases, and Ecosystem
Nemotron 3 Nano is released under a commercially permissive license, enabling its use in research, industry, and collaborative environments (NVIDIA et al., 23 Dec 2025, NVIDIA et al., 24 Dec 2025). It is aimed at cost-sensitive large-scale inference, edge/low-memory deployment, agentic research, and bulk long-context applications. Its weights, training code (NeMo, Megatron-LM), and RL+SFT data are available via NVIDIA and Hugging Face releases.
Notable ecosystem developments include:
- Enhanced post-SFT/RL gen reasoning and full dynamic reasoning/collapse via token budget control.
- Open multimodal checkpoints (Nano Omni) supporting text, vision, audio, and video, with public recipes for SFT and RL.
- Benchmark results and sample code for quantized inference and distributed sharded serving on Blackwell, Hopper, and Ampere GPUs.
7. Comparative Context and Derivative Models
Nemotron 3 Nano has influenced a range of derivative and adjacent models:
- Llama-Nemotron family: Efficient reasoning transformers in 5 billion-parameter classes, integrating post-training and reasoning toggles (Bercovich et al., 2 May 2025).
- Nemotron 3 Super/Ultra: Larger counterparts employing LatentMoE, more extensive RL pipelines, and MTP layers for ultra-fast text generation (NVIDIA et al., 24 Dec 2025).
- TwoTower diffusion modeling: Establishing the utility of frozen backbone + lightweight denoiser towers for high-throughput, high-fidelity text generation (Reda et al., 25 Jun 2026).
Nemotron 3 Nano thus delineates the current design space for high-efficiency, long-context, RL-robust, quantized LLMs in the open research landscape.