Papers
Topics
Authors
Recent
2000 character limit reached

Nemotron Nano 2: Hybrid Mamba-Transformer

Updated 22 November 2025
  • Nemotron Nano 2 is a series of compact, high-throughput language and vision-language models that integrate Transformer architecture with Mamba-2 state-space layers.
  • It leverages aggressive FP8 quantization, knowledge distillation, and parameter pruning to reduce memory usage while maintaining SoTA reasoning performance.
  • The design supports dynamic reasoning control, zero-shot submodel extraction, and efficient deployment across diverse hardware budgets.

Nemotron Nano 2 denotes a series of compact, high-throughput, reasoning-oriented language and vision-LLMs based on hybrid Mamba-Transformer backbones, pioneered by NVIDIA and collaborators. The models are engineered for computational efficiency and state-of-the-art (SoTA) reasoning accuracy within strict memory and hardware footprints, targeting both text and multimodal domains. Key distinguishing features include extensive adoption of Mamba-2 state-space layers, aggressive quantization (FP8), advanced knowledge distillation, highly-structured model compression, and novel support for reasoning control and deployment flexibility across a unified weight space.

1. Hybrid Mamba–Transformer Model Architecture

The core innovation of Nemotron Nano 2 lies in its hybridization of the Transformer architecture with Mamba-2 state-space model (SSM) layers. In the canonical Nano 2 backbone, approximately 92% of standard self-attention layers are replaced by Mamba-2 blocks, resulting in a structure exemplified by 62 layers: 6 multi-head self-attention, 28 Mamba-2, and 28 feed-forward layers (NVIDIA et al., 20 Aug 2025). The Mamba-2 layer architecture partitions the hidden state (e.g., dm=5120d_m = 5120) into grouped subspaces, each processed by parallel, causal 1D convolutions and SSM modules. For a group gg: Ug=XWx,gRB×S×dh,Zg=XWz,g,Cg=Conv1D(Ug;wdt)+SSMg(Zg;WB,g,WC,g,Wdt,g)U_g = X W_{x,g}^{\top} \in \mathbb{R}^{B \times S \times d_h}, \quad Z_g = X W_{z,g}^{\top}, \quad C_g = \mathrm{Conv1D}(U_g; w_{dt}) + \mathrm{SSM}_g(Z_g; W_{B,g}, W_{C,g}, W_{dt,g}) A gated RMSNorm (GateNorm) modulates the SSM output, maintaining long-range information propagation while greatly reducing O(S2)O(S^2) complexity of conventional attention.

The backbone enables scaling to unprecedented context lengths (tested up to 128k128\,\mathrm{k} tokens on \texttt{A10G} 22GB cards) with minimal key-value cache overhead. Architectural pruning and parameter reduction are guided by empirical importance metrics at both layer and neuron (for feed-forward networks) and head (for Mamba groups) levels. FP8 quantization (E4M3) is utilized throughout the majority of training, halving memory consumption compared to bfloat16, with no observed instability (NVIDIA et al., 20 Aug 2025).

2. Training and Compression Pipeline

Nemotron Nano 2 relies on a multi-stage training regime combining aggressive pretraining, alignment, and compression. The canonical sequence comprises:

  • Pretraining a 12B-parameter base model (“Nano-12B-v2-Base”) on 20T tokens in FP8 (NVIDIA et al., 20 Aug 2025).
  • Multi-stage alignment: supervised fine-tuning (SFT), direct preference optimization (DPO), generalized RPO (GRPO), and RLHF, including data with "think-trace" truncation for reasoning trace calibration.
  • Minitron-style compression through importance-based pruning and distillation, reducing the backbone (e.g., from 12B to 9B parameters) while retaining ≳98% SoTA reasoning performance at vastly lower compute/memory.

Pruning is informed by measuring logit sensitivity to candidate layer removals (Δi=EXLogitsorig(X)Logitsprunedi(X)2/V\Delta_i = {\mathbb{E}_X} \lVert \text{Logits}_{\text{orig}}(X) - \text{Logits}_{\text{pruned} \to i}(X)\rVert^2 / |V|). FFN neurons and Mamba heads are similarly ranked and retained to meet target memory budgets, subject to hardware constraints (e.g., 19.66 GiB usable on A10G at 128K context). Distillation replay and checkpoint merging (e.g., 50/50 interpolation for reasoning/chat balance) finalize model tuning (NVIDIA et al., 20 Aug 2025).

3. Inference, Reasoning Modes, and Deployment

Nemotron Nano 2 is designed for efficient inference and supports dynamic reasoning control. The models expose a "reasoning-on/off" toggle implemented via prompt prefixes or (in larger Nemotron family variants) through dedicated router MLPs that select active subspaces at runtime (Bercovich et al., 2 May 2025).

Empirically, the 9B pruned model achieves 156.4 tokens/sec (vs. Qwen3-8B's 24.8) on 8K input/16K output, batch size 8—amounting to a 6.3× throughput advantage for long-range reasoning (NVIDIA et al., 20 Aug 2025). Reasoning accuracy is state-of-the-art: e.g., 91.4% on GSM8K-CoT, 63.6% on MATH Level 5, and 82.2% on RULER@128K. Memory usage for weights and cache is tightly scoped, staying <20 GiB for a 9B model in bf16 (excluding context-dependent activation overhead).

The Llama-Nemotron Nano ("LN-Nano," 8B) variant demonstrates further reasoning efficiency by leveraging neural architecture search (NAS, via the Puzzle system), FFN fusion, and vertical compression. Dynamic mode switching is performed by inserting "detailed thinking on/off" flags in the input; no additional gating network is required (Bercovich et al., 2 May 2025).

4. Vision-Language and Multimodal Extensions

Nemotron Nano 2 underpins vision-LLMs such as Nemotron Nano V2 VL (12B), which combine a c-RADIOv2 vision encoder, a two-layer MLP connector, and the hybrid Mamba-Transformer backbone (NVIDIA et al., 6 Nov 2025). These models use both patch-merging and Efficient Video Sampling (EVS) to prune redundant spatial tokens in images and videos, facilitating long-context multimodal inference at high throughput (e.g., up to 85 tokens/sec with 60% EVS on FP8).

Vision–language SFT is applied in multiple stages, involving both frozen and joint training of the connector, encoder, and Mamba-Transformer decoder. The backend maintains >98% of the text-only accuracy after heavy multimodal training. Quantization-aware distillation (QAD) to FP4 further boosts inference speed with negligible accuracy loss (NVIDIA et al., 6 Nov 2025).

5. Nested Models and Efficient Elasticity

Extending the architecture, "Nemotron Elastic" enables nesting of multiple extractable submodels (“many-in-one”) within a single 12B parent, each optimized for a different deployment budget (e.g., 6B, 9B, 12B), with zero-shot slicing at inference (Taghibakhshi et al., 20 Nov 2025). This process is managed by a trained router—conditioned via differentiable Gumbel-Softmax sampling over budget vectors for each elastic axis (embedding, Mamba, attention, FFN, depth)—producing binary masks for activation. Zero-shot submodel extraction imposes no additional fine-tuning or retraining overhead.

Elastic training is performed over only 110B tokens, representing a 360× reduction in training cost versus independent models from scratch and a 7× gain over SoTA compression. All models reside in a shared 24 GB bf16 checkpoint (+<1 GB metadata), saving ≈43% deployment memory. Submodels reach or surpass baseline SoTA accuracy: e.g., Elastic-6B achieves 70.61 vs. NanoV2-9B at 75.99 and Elastic-12B at 77.41 on reasoning benchmarks. Extended-context training yields up to +19.8% absolute accuracy improvement for smaller submodels (Taghibakhshi et al., 20 Nov 2025).

6. Model Access, Licensing, and Practical Usage

Nemotron Nano 2 and associated family members are released under the NVIDIA Open Model License, supporting broad academic and commercial use. Model checkpoints (e.g., bf16, FP8, FP4) and datasets are hosted on HuggingFace (see references (NVIDIA et al., 20 Aug 2025, NVIDIA et al., 6 Nov 2025)). The codebase supports common deep learning frameworks including NeMo, Megatron-LM, vLLM for fast inference, and context-parallel optimization for extremely long sequences.

Python code snippets for model loading, inference (including batch multimodal pipelines and EVS toggling), and quantization settings are included in the official documentation (NVIDIA et al., 6 Nov 2025). The architecture is compatible with state-of-the-art scaling strategies such as pipeline and model parallelism through DeepSpeed or HuggingFace Accelerate.


References

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Nemotron Nano 2.