NVIDIA Nemotron 3: Open-Efficient LLMs
- NVIDIA Nemotron 3 is a family of open-weight large language models featuring Nano, Super, and Ultra variants, each optimized for cost efficiency, collaborative tasks, or state-of-the-art reasoning across up to 1M tokens.
- The models employ a hybrid Mamba–Transformer Mixture-of-Experts architecture with innovations like LatentMoE and NVFP4 hardware-aware quantization, ensuring high throughput and scalability.
- They are trained via multi-environment reinforcement learning and multi-token prediction strategies, yielding enhanced reasoning control, accurate tool integration, and competitive performance benchmarks.
NVIDIA Nemotron 3 denotes a family of open-weight LLMs designed for efficient agentic reasoning, conversational ability, and multi-step tool integration across extreme context lengths. Comprising Nano (released), Super, and Ultra variants, these models employ a hybrid Mixture-of-Experts (MoE) Mamba–Transformer backbone for sparse scaling, best-in-class throughput, and a context window up to one million tokens. They introduce hardware-aware quantization (NVFP4), novel expert-routing methods (LatentMoE), and multi-environment reinforcement learning to enable flexible and granular reasoning budget control (NVIDIA et al., 24 Dec 2025, NVIDIA et al., 23 Dec 2025).
1. Model Family and Configurations
The Nemotron 3 lineup consists of three models optimized for different throughput and inference demands, each realized as a hybrid Mamba–Transformer MoE:
| Variant | Total Params | Active Params/token | Target Use Case |
|---|---|---|---|
| Nano | ~30B | ~3B | Cost-efficient inference, chatbots, embedded agents |
| Super | ~73B | ~8B | Collaborative agents, high-volume IT/automation |
| Ultra | ≫73B | ≫8B | State-of-the-art reasoning, unconstrained multi-step use |
All models share core architectural advances and support a 1M-token context. Each model is post-trained with multi-environment RL to support granular control over reasoning token budgets and enhanced tool integration (NVIDIA et al., 24 Dec 2025).
2. Hybrid Mamba–Transformer MoE Architecture
The architectural innovation of Nemotron 3 is a hybrid block design that interleaves linear-time Mamba-2 state-space layers, sparse Mixture-of-Experts (MoE) multilayer perceptrons, and select sparse self-attention layers with @@@@2@@@@ (GQA):
- Layer Pattern (Nano example):
- Mamba-2 block → MoE block (sparse FFN over ) → Mamba-2 block → [repeat]
- Several sparse self-attention layers inserted intermittently, using 2 KV heads for memory efficiency
- MoE Gating/Selection: For input , a gating network produces over experts: . Top- experts are selected and aggregated as:
Unselected experts are bypassed.
- Nano model specifics: 52 total layers, dimension , 32 query and 2 key/value attention heads, head size 128, with every other FFN replaced by a 128-expert MoE (top-6 experts/token activated) (NVIDIA et al., 23 Dec 2025).
This design replaces quadratic Transformer attention costs ( KV-cache) with constant-memory Mamba state-space recurrences, retaining routing and sequence modeling fidelity via MoE and strategic sparse attention.
3. LatentMoE and Quantization
LatentMoE
Super and Ultra integrate LatentMoE, which projects the per-token expert-routed activation from to a lower dimension (), computes MoE operations in , and projects back to :
- Top- routing and mixing over
- Memory/communication scales with ; savings are used to scale expert count and diversity ()
Empirically, LatentMoE yields higher accuracy at the 8B-active/72B-total scale, e.g., MMLU-Pro improves from 48.3% to 52.87%, Math from 78.32% to 80.19% (see paper, Table) (NVIDIA et al., 24 Dec 2025).
NVFP4 Quantization
Super and Ultra are pretrained using NVFP4, a hardware-native FP4 quantization format with:
- 16-element micro-block scaling (E4M3)
- Global FP32 scale factor for range
- Hadamard transforms (RHTs) on weight gradients; stochastic rounding
- Select sensitive layers (QKV, attention, Mamba out) in BF16/MXFP8
This maintains ≤1% relative loss gap to BF16 and enables 3× peak throughput versus FP8 on Blackwell Ultra (NVIDIA et al., 24 Dec 2025).
4. Training Methodology and Reinforcement Learning
Nemotron 3 models undergo extensive multi-stage training:
- Pretraining: For Nano, 25T tokens, including >3T new data over Nemotron 2 (diverse web/code/math/synthetic and high-quality data). Continuous pretraining extends context capability (Nano: up to 512k tokens).
- Supervised Fine-Tuning (SFT): Multi-domain datasets (competition math, code, long-context QA, proofs, multilingual content, tool use, science QA, formal proofs) with alignment for reasoning control and budget specification.
- RL from Verifiable Rewards (RLVR): Joint training across diverse environments (math, code, tool use, conversational search, JSON, structured tasks) with synchronous GRPO, masked importance sampling, and selection curriculum. RLVR alone achieves domain generalization superior to SFT (NVIDIA et al., 23 Dec 2025).
- RLHF: Generative reward models for pairwise preference learning; length-penalized circular comparison graphs to reduce verbosity without degrading accuracy (30% verbosity reduction without loss) (NVIDIA et al., 23 Dec 2025).
The RL objective is:
with asynchronous rollout/learn architectures and MTP acceleration.
5. Context Length, Inference, and Reasoning Budget Control
Models support up to 1M-token sequence lengths:
- No rotary position encoding (RoPE), preventing out-of-distribution decay
- Contextual pretraining to 512k tokens; SFT to 256k; RL environments to 32k
- For 1M-token code sequences, per-token NLL declines steadily (see Figure), confirming effective long-context utilization (NVIDIA et al., 24 Dec 2025)
Inference offers user-specified reasoning ("thinking") token budgets ( tokens), with a special token marking budget completion. Varying yields controllable tradeoffs between accuracy and latency at inference time.
6. Multi-Token Prediction and Throughput
Multi-Token Prediction (MTP) layers (Super, Ultra) predict future tokens per position, not just next-token:
This yields approximately 2.4% absolute accuracy improvement across benchmarks and up to 3× end-to-end speedups for long-form text generation. Speculative decoding accepts 97% of the first two MTP tokens at batch size 1 (NVIDIA et al., 24 Dec 2025).
Nano achieves 3.3× higher throughput than Qwen3-30B MoE on 8k–16k sequence lengths, with further increases at longer contexts. Throughput is measured using vLLM and TensorRT-LLM, with quantized inference (FP8 for Nemotron and Qwen3, mxfp4+bfloat16 for GPT-OSS) (NVIDIA et al., 23 Dec 2025).
7. Capabilities, Performance, and Open Release
Nemotron 3 models exhibit competitive or state-of-the-art results across MMLU, code, math, commonsense reasoning, retrieval (RULER), and reasoning-on/off control. Post-trained Nano, compared to Qwen3-30B-Thinking and GPT-OSS-20B, shows:
- Greater reasoning and tool-use accuracy (e.g., IFBench 71.51 vs 51.00/65.00, RULER-100 at 86%+ for 1M tokens)
- Superior cost efficiency (normalized compute cost to reach targets lower than prior models)
- Diminished hallucination rates in DPO (from 8.3%→0.7% on GPQA)
- Open-weight and recipe release for Nano; forthcoming for Super/Ultra, including ~10T tokens of redistributed data and open NeMo training pipelines under Apache 2.0 (NVIDIA et al., 24 Dec 2025, NVIDIA et al., 23 Dec 2025)
The primary limitations are minor token-level sensitivity, residual out-of-distribution biases/hallucinations, and slightly lower multilingual reasoning than monolingual English.
References:
(NVIDIA et al., 24 Dec 2025) "NVIDIA Nemotron 3: Efficient and Open Intelligence" (NVIDIA et al., 23 Dec 2025) "Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning"