Papers
Topics
Authors
Recent
Search
2000 character limit reached

Nemotron 3 Nano: Hybrid MoE LLM

Updated 30 June 2026
  • Nemotron 3 Nano is a 30-billion parameter large language model that integrates Mamba-2 state-space layers, transformer self-attention, and sparse MoE FFNs for enhanced throughput and long-context capabilities.
  • It employs a robust training regimen with massive pretraining, supervised fine-tuning on diverse datasets, and multi-environment reinforcement learning to boost reasoning, coding, and multimodal performance.
  • Advanced quantization strategies, including NVFP4 and QAD, recover over 95% benchmark accuracy while significantly lowering memory costs and achieving up to 1.8M tokens/sec throughput per GPU.

Nemotron 3 Nano is a 30-billion-parameter LLM employing a Mixture-of-Experts (MoE) hybrid Mamba-Transformer architecture, developed by NVIDIA for high-throughput, cost-efficient inference with advanced reasoning, agentic behavior, and long-context support. As the foundational model for recent agentic, code, and multimodal systems, it sets a Pareto frontier for active parameter efficiency and hardware utilization in the open-weight LLM ecosystem (NVIDIA et al., 23 Dec 2025, NVIDIA et al., 24 Dec 2025, Xin et al., 27 Jan 2026, NVIDIA et al., 27 Apr 2026, Reda et al., 25 Jun 2026).

1. Model Architecture and MoE Hybridization

Nemotron 3 Nano integrates Mamba-2 state-space layers with standard Transformer self-attention layers and sparse MoE feed-forward networks in an interleaved stack. Its typical backbone consists of 52 layers: 23 Mamba-2 layers, 6 attention layers (often grouped), and 23 MoE layers for a total of approximately 30–31.6 billion parameters, of which ~3.6 billion are "active" per token due to MoE sparsity (NVIDIA et al., 23 Dec 2025, Reda et al., 25 Jun 2026). The key architectural principles include:

  • Mamba-2 layers: State-space models providing constant-memory O(dLdL) recurrences for long-range dependencies without resorting to rotary encodings or classic position embeddings.
  • Self-attention blocks: Grouped-query attention (GQA) with e.g., 32 (query) × 2 (key/value) heads and head dimension ~128.
  • MoE FFN Blocks: Gating routers select top-kk out of EE experts per token (e.g., k=6k=6, E=128E=128 typical for Nano 30B-A3B), using softmax gating and squared-ReLU or other custom nonlinearities.

All primitive types share a hidden dimensionality dd in the 2–4k range (e.g., d=2688d=2688–$4096$), with MoE expert FFNs having local dimensions up to ~16k. The MoE blocks use GShard-style load balancing loss with DeepSeek-inspired aux-loss-free regularization to prevent expert collapse. The model's parameter count is summarized in the table:

Layer Type Count Hidden Dim Experts/Block Active/tok
Transformer/self-attention 6–8 4096 N/A Full
Mamba-2 SSM 20–30 4096 N/A Full
MoE FFN ~20–30 4096/1856 128 6

This hybridization enables efficient context extension and expert capacity (NVIDIA et al., 23 Dec 2025, Reda et al., 25 Jun 2026).

2. Training Regimen: Pretraining, SFT, and RL

The Nemotron 3 Nano training pipeline involves three distinct phases:

For long-context capability, Nemotron 3 Nano employs a curriculum culminating in a "CPT" (context parallel training) phase with up to 1M-token simulated contexts and mixed 4k/512k sequences, leveraging 8-way context parallelism and 8-way expert/tensor parallel scaling.

3. Quantization Methodologies: NVFP4 and QAD Recovery

Reducing inference cost is critical; Nemotron 3 Nano employs multi-tier quantization:

  • NVFP4 Quantization: A 4-bit floating-point format with (sign, 2-bit exponent, 1-bit mantissa), leveraging local per-block FP8 scales (sE4M3s_{\rm E4M3}) and a global FP32 scale (sFP32s_{\rm FP32}), enabling kk0 reduction from BF16 and kk1 higher throughput on NVIDIA Blackwell NVFP4 cores. Quantization is first applied post-training (PTQ) to all weights except a few stabilizing attention/Mamba layers (kept in BF16). Activations are quantized layerwise at runtime (Xin et al., 27 Jan 2026).
  • Quantization-Aware Distillation (QAD): To recover accuracy lost in PTQ and avoid instability in quantization-aware training (QAT), QAD distills a full precision teacher into the NVFP4-quantized student using KL divergence between softmax outputs:

kk2

Only soft logits from the teacher are required, and QAD runs as a single posttraining stage on kk3B SFT/RL tokens, robust to data quality and source. QAD recovers kk4 of BF16 benchmark accuracy and yields superior performance to QAT—especially for RL-heavy skills (Xin et al., 27 Jan 2026).

Variant AA-LCR AIME25 GPQA-D LiveCode-v5 SciCode
BF16 35.9 89.1 73.0 72.1 33.0
NVFP4 PTQ 31.3 85.0 71.6 68.9 30.5
QAD (SFT+RL mix) 34.3 87.9 72.7 68.9 32.3
QAT 24.8 83.3 66.0 62.0 25.8

QAD is thus a best-practice for quantized post-training in large hybrid models (Xin et al., 27 Jan 2026).

4. Performance Benchmarks and Efficiency Analysis

Nemotron 3 Nano achieves best-in-class open-model performance on reasoning, coding, math, and long-context benchmarks:

  • MMLU-Pro (5-shot): 78.3%
  • AIME25: 89.06%
  • LiveCodeBench: 68.25%
  • MiniF2F pass@1: 50.03%
  • RULER-100 @1M tokens: 86.34%

MoE and hybridization yield a 3.3×–3.8× real throughput gain versus dense 30B models (e.g., GPT-OSS-20B, Qwen3-30B-A3B), with context length scaling up to 1M tokens (NVIDIA et al., 23 Dec 2025, NVIDIA et al., 24 Dec 2025). Memory and compute costs are kk5 to kk6 those of FP16/FP8 models under NVFP4. Inference on Blackwell GB300 achieves up to 1.8M tokens/sec per GPU at NVFP4, with total weight storage kk7GB for 1M-token contexts (NVIDIA et al., 24 Dec 2025).

5. Advanced Features: Long-Context, Multimodality, and Parallel Generation

Nemotron 3 Nano supports features that distinguish it from previous large models:

  • Long-context handling: Elimination of rotary position encodings; Mamba state-space recurrences enable native extrapolation to kk81M context. KV cache and memory scale only with the handful of attention layers, not model depth.
  • Multimodal expansion: Nemotron 3 Nano Omni, based on the Nano 30B-A3B backbone, integrates vision (ViT), audio (FastConformer), and video into a unified context using patch/fusion, Conv3D temporal compression, and Efficient Video Sampling (EVS) to reduce memory and attention costs. Quantized NVFP4 deployment achieves kk9 original memory with EE0 accuracy loss; real-world throughput is EE1–EE2 that of comparably sized models on key multimodal tasks (NVIDIA et al., 27 Apr 2026).
  • Diffusion language modeling: As the frozen context tower in Nemotron-TwoTower, Nano 30B-A3B supports blockwise masked denoising, delivering EE3 generation throughput for a EE4 accuracy drop (Reda et al., 25 Jun 2026).

6. Use Cases, Model Releases, and Ecosystem

Nemotron 3 Nano is released under a commercially permissive license, enabling its use in research, industry, and collaborative environments (NVIDIA et al., 23 Dec 2025, NVIDIA et al., 24 Dec 2025). It is aimed at cost-sensitive large-scale inference, edge/low-memory deployment, agentic research, and bulk long-context applications. Its weights, training code (NeMo, Megatron-LM), and RL+SFT data are available via NVIDIA and Hugging Face releases.

Notable ecosystem developments include:

  • Enhanced post-SFT/RL gen reasoning and full dynamic reasoning/collapse via token budget control.
  • Open multimodal checkpoints (Nano Omni) supporting text, vision, audio, and video, with public recipes for SFT and RL.
  • Benchmark results and sample code for quantized inference and distributed sharded serving on Blackwell, Hopper, and Ampere GPUs.

7. Comparative Context and Derivative Models

Nemotron 3 Nano has influenced a range of derivative and adjacent models:

  • Llama-Nemotron family: Efficient reasoning transformers in EE5 billion-parameter classes, integrating post-training and reasoning toggles (Bercovich et al., 2 May 2025).
  • Nemotron 3 Super/Ultra: Larger counterparts employing LatentMoE, more extensive RL pipelines, and MTP layers for ultra-fast text generation (NVIDIA et al., 24 Dec 2025).
  • TwoTower diffusion modeling: Establishing the utility of frozen backbone + lightweight denoiser towers for high-throughput, high-fidelity text generation (Reda et al., 25 Jun 2026).

Nemotron 3 Nano thus delineates the current design space for high-efficiency, long-context, RL-robust, quantized LLMs in the open research landscape.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nemotron 3 Nano.