Nemotron-Nano-9B-v2: Efficient Long-Context LLM

Updated 21 August 2025

Nemotron-Nano-9B-v2 is a hybrid Mamba-Transformer model that replaces most self-attention layers with efficient Mamba-2 state space layers for scalable long-context reasoning.
The model employs advanced pruning and distillation techniques to reduce the parameter count from 12B to 9B while maintaining state-of-the-art accuracy and throughput.
It demonstrates superior performance on reasoning benchmarks like GSM8K and MATH, and can handle up to 128,000-token sequences on a single NVIDIA A10G GPU.

Nemotron-Nano-9B-v2 is an open-source hybrid Mamba-Transformer LLM that integrates recent architectural advances in efficient state space models, extensive large-scale training, and targeted model compression techniques to address the demands of long-context reasoning tasks with high throughput and state-of-the-art accuracy for its parameter class. Developed as part of the Nemotron-Nano family under the framework described in "NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model" (NVIDIA et al., 20 Aug 2025), this model combines selective Transformer self-attention layers with Mamba-2 state space layers, and is released along with comprehensive checkpoints and datasets for open research and deployment.

1. Hybrid Mamba-Transformer Architecture

Nemotron-Nano-9B-v2 is the result of pruning and distilling a 12-billion-parameter parent model (Nemotron-Nano-12B-v2-Base) pre-trained with a hybrid architecture. The core innovation lies in the Nemotron-H hybridization scheme, where most standard Transformer self-attention layers are replaced with Mamba-2 layers, which utilize a selective state space mechanism for efficient sequence processing. This arrangement is motivated by the observation that full self-attention is computation- and memory-intensive, particularly in applications requiring “long thinking traces” for deep reasoning, whereas Mamba-2 layers enable more scalable sequence modeling.

In the 12B base model, the 62 layers are distributed as follows:

6 self-attention layers,
28 feed-forward network (FFN) layers,
28 Mamba-2 state space layers.

FFN layers follow the transformation $FFN(X) = \delta(X W_1^T) W_2$ , where $\delta$ is a squared ReLU activation and $W_1,\ W_2$ are the learned weight matrices. Mamba-2 layers process grouped representations (eight groups per layer) with separate projections (e.g., $W_x,\ W_z$ ), apply causal 1D convolutions, and incorporate selective state updates with a state dimension of 128 and head dimension of 64. This mixed structure allows the model to inherit the inductive biases and empirical performance of Transformers on reasoning tasks, while drastically reducing operational cost for long context generation.

2. Training Pipeline and Compression

The training regime begins with Nemotron-Nano-12B-v2-Base pre-trained on a corpus of 20 trillion tokens using an FP8 training recipe. The use of E4M3 quantization for most tensors— $128\times128$ blocks for weights, $1\times128$ tiles for activations—enables efficient low-precision computation, with bfloat16 retained for the initial and terminal matrix multiplications to stabilize the process.

Alignment of the base model involves multiple supervised and reinforcement learning stages:

Supervised Fine-Tuning (SFT) on approximately 90 billion tokens, heavily featuring prompt–response and truncated thinking trace formats,
Group Relative Policy Optimization (GRPO),
Direct Preference Optimization (DPO),
Reinforcement Learning from Human Feedback (RLHF).

The distilled Nemotron-Nano-9B-v2 model is produced using a Minitron-based pruning and distillation strategy that targets memory and throughput constraints:

Layer importance is evaluated by the MSE on logits upon layer ablation;
FFN neurons and embedding channels are pruned using forward activations aggregated over batch and sequence dimensions;
Mamba-2 heads are ranked for relevance, with group structure preserved.

These steps collectively reduce the parameter count from 12B to 9B while retaining high task accuracy and enabling long-context inference.

3. Reasoning Benchmark Performance and Inference Efficiency

Nemotron-Nano-9B-v2 demonstrates state-of-the-art accuracy on adversarial and multi-step reasoning benchmarks relative to its parameter scale, including GSM8K, MATH, and general language understanding tasks such as MMLU and MMLU-Pro. The model is directly compared to other sub-10B models, most notably Qwen3-8B, and shown to yield equal or superior results across these standardized benchmarks.

A key metric is inference throughput under reasoning-heavy workloads. In tasks with 8,000-token input and 16,000-token output (e.g., document-level reasoning, long-form completion), Nemotron-Nano-9B-v2 delivers up to 6× higher throughput than Qwen3-8B, attributable to the replacement of quadratic-complexity attention layers with efficient Mamba-2 blocks, without compromising long-context task performance.

4. Long-Context Deployment and Hardware Utilization

A central characteristic of Nemotron-Nano-9B-v2 is its long-context handling: the pruned 9B model can process sequences up to 128,000 tokens in context on a single NVIDIA A10G GPU (22 GiB memory, bfloat16 precision). This is made feasible by the architectural and compression choices—specifically, the Mamba-2 state space layers reduce both computational and kv-cache memory usage and the Minitron pruning strategy reduces overall footprint.

This capacity enables efficient deployment for applications including multi-document summarization, stepwise reasoning over extensive contexts, and question answering requiring deep contextual memory, all on a single commodity GPU, eliminating the need for distributed inference infrastructure.

5. Released Checkpoints and Dataset Resources

The Nemotron-Nano-9B-v2 project is distinguished by open-source release of both model checkpoints and pre-/post-training datasets. The following resources are provided via Hugging Face:

NVIDIA-Nemotron-Nano-9B-v2: final aligned and pruned 9B model,
NVIDIA-Nemotron-Nano-9B-v2-Base: pruned base model (pre-alignment),
NVIDIA-Nemotron-Nano-12B-v2-Base: full parameter base model.

Datasets released include:

Nemotron-CC: updated Common Crawl-derived web corpus,
Nemotron-CC-Math: focused math datasets,
Nemotron-Pretraining-Code-v1: code datasets,
Synthetic SFT-style datasets for STEM, multilingual, and reasoning tasks.

Researchers can replicate experiments, fine-tune on downstream tasks, or incorporate Nemotron-Nano-9B-v2 in production inference pipelines using these fully open resources.

6. Context in the Landscape and Implications

Nemotron-Nano-9B-v2 exemplifies current trends in LLM design: targeted use of state space models (Mamba-2) to address the quadratic complexity bottleneck of Transformers in long-context regimes, complemented by post-hoc pruning/distillation to meet real-world hardware and throughput constraints. This design enables both high-accuracy reasoning and efficient inference for tasks that exceed context capabilities of conventional Transformer models in the same parameter or hardware footprint category.

A plausible implication is that Nemotron-Nano-9B-v2, through its open release, will serve as a baseline for further research on hybrid state space/Transformer architectures, distillation/pruning methods tailored to reasoning performance, and efficient long-context deployment strategies across general NLP and applied AI domains.

PDF Markdown Chat (Pro)

References (1)

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model (2025)

Follow Topic

Get notified by email when new papers are published related to Nemotron-Nano-9B-v2.