Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 103 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 27 tok/s

GPT-5 High 37 tok/s Pro

GPT-4o 92 tok/s

GPT OSS 120B 467 tok/s Pro

Kimi K2 241 tok/s Pro

2000 character limit reached

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model (2508.14444v3)

Published 20 Aug 2025 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer LLM designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.

Collections

Summary

The paper introduces Nemotron Nano 2, a hybrid Mamba-Transformer model that achieves state-of-the-art reasoning and throughput.
It employs a sparse self-attention pattern with only ~8% of layers using self-attention, combined with aggressive pruning and distillation techniques.
The model demonstrates strong performance in math, code, and multilingual benchmarks while enabling 128k context inference on commodity GPUs.

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

Model Architecture and Pretraining

Nemotron Nano 2 is a hybrid Mamba-Transformer LLM, leveraging the Nemotron-H architecture with a sparse self-attention pattern: only ~8% of layers are self-attention, evenly distributed, while the majority are Mamba-2 layers and FFNs. The base model, Nemotron-Nano-12B-v2-Base, comprises 62 layers (6 self-attention, 28 FFN, 28 Mamba-2), with a hidden size of 5120 and FFN dimension of 20480. Grouped-Query Attention is used for efficient KV cache scaling, and squared ReLU activation is adopted for FFNs. The model omits position embeddings, uses RMSNorm, and is trained without dropout or bias weights.

Figure 1: Layer pattern of Nemotron-Nano-12B-v2-Base, showing sparse self-attention distribution.

Pretraining is performed on 20T tokens using FP8 numerics (E4M3 for tensors, BF16 for first/last matmuls, FP32 optimizer state), with a WSD learning rate schedule. The data blend is highly curated and synthetic, spanning web crawl (Nemotron-CC), math (Nemotron-CC-Math), code (GitHub, license-filtered), academic, multilingual, and SFT-style data. Notably, the math pipeline preserves LaTeX and code structure, outperforming prior datasets. Synthetic data generation is extensive, including STEM, math dialogues (MIND), multilingual QA, code QA, and academic QA, with rigorous deduplication and quality filtering.

Curriculum learning is applied: Phase 1 promotes diversity, Phases 2/3 emphasize high-quality sources (e.g., Wikipedia). Long-context capability is instilled via continuous pretraining at 512k sequence length, with synthetic long-document QA data, enabling robust 128k context inference.

Figure 2: Data mixture composition for Phase 1 of pretraining.

Alignment and Post-Training

Alignment is a multi-stage pipeline: three SFT stages (general, tool-calling, long-context/truncated reasoning), followed by DPO (tool-calling), GRPO (instruction-following), RLHF (chat/helpfulness), and checkpoint interpolation for trade-off balancing. SFT data covers math, code, science, tool use, conversational, safety, and multilingual domains, with targeted augmentation for tool-calling and long-context robustness. DPO and GRPO are critical for function-calling and instruction-following, while RLHF improves chat alignment but may degrade multi-task performance, which is recovered via model merging.

Budget control is implemented by allowing users to specify the number of "thinking" tokens before the final answer. Truncation training ensures well-formed outputs and prevents compensation effects in the final answer when the reasoning trace is restricted.

Pruning, Distillation, and Compression

To enable 128k context inference on a single A10G GPU (22 GiB), Nemotron Nano 2 employs aggressive pruning and distillation, extending the Minitron framework. Importance estimation is performed via forward-pass MSE for layers, activation-based scoring for FFN neurons and embedding channels, and group-aware activation scoring for Mamba heads. Layer pruning is iterative, removing those with minimal impact on logits. FFN and embedding pruning follow Minitron's aggregation strategies. Mamba head pruning is group-aware, but for modest compression ratios (<15%), FFN/embedding pruning suffices.

A lightweight NAS framework enumerates candidate architectures under a strict memory budget (19.66 GiB for weights + KV cache). Depth is fixed at 56 layers (4 attention), with width pruning along embedding, FFN, and Mamba axes. Candidates are distilled for 19B tokens, and throughput/accuracy are benchmarked. The selected architecture (56 layers, 4480 hidden, 15680 FFN, 128 Mamba heads, ~8.9B params) achieves the best accuracy-throughput trade-off.

Distillation is staged: depth pruning + KD (60B tokens), width pruning + KD (50B tokens @ 8k, 25B @ 49k, 1B @ 262k), DPO, GRPO, RLHF, and model merging. Reasoning model distillation uses a 70/30 blend of post-training and pretraining data for optimal math accuracy. The base model is distilled with 100% pretraining data.

Figure 3: Task accuracy progression across distillation pipeline stages for Nemotron Nano 2.

Evaluation and Results

Nemotron Nano 2 achieves strong results across reasoning, code, math, multilingual, and long-context benchmarks. On MMLU, MMLU-Pro, AGIEval, GSM8K, MATH, AIME, HumanEval+, MBPP+, ARC, HellaSwag, OpenBookQA, PIQA, WinoGrande, and RULER-128k, the model matches or surpasses Qwen3-8B and Gemma3-12B, with especially pronounced gains in math (MATH: 80.5 vs. Qwen3-8B's 55.4) and long-context (RULER-128k: 82.22 vs. Gemma3-12B's 80.7).

Figure 4: Nemotron Nano 2 vs. Qwen3-8B: comparable or superior accuracy, up to 6.3× higher throughput for complex reasoning workloads.

Throughput is a key differentiator: Nemotron Nano 2 achieves 3–6× higher output token generation rates for generation-heavy scenarios (e.g., 8k input/16k output), with efficient memory scaling enabling 128k context inference on commodity hardware.

Multilingual performance is robust, with Global-MMLU-Lite and MGSM scores exceeding Qwen3-8B and Gemma3-12B in most languages. Ablation studies confirm the efficacy of synthetic DiverseQA-crawl data and fundamental reasoning SFT-style data for boosting multilingual and reasoning accuracy.

Implementation and Practical Considerations

Nemotron Nano 2 is open-sourced, with checkpoints for the 9B-v2 aligned/pruned model, 9B-v2 base, and 12B-v2 base, plus the majority of pre- and post-training datasets. The model is optimized for vLLM and TensorRT-LLM inference, with FP8/BF16 support and memory-aware architecture. The pruning/distillation pipeline is modular, enabling adaptation to other hardware or context length constraints.

For deployment, the model supports explicit budget control for reasoning traces, robust long-context handling, and efficient tool-calling. The architecture is well-suited for reasoning-intensive applications (e.g., math, code, multi-step tool use) and multilingual scenarios. The open datasets facilitate further research in data-centric LLM training.

Implications and Future Directions

Nemotron Nano 2 demonstrates that hybrid Mamba-Transformer architectures, combined with aggressive pruning/distillation and data-centric training, can deliver state-of-the-art reasoning accuracy and throughput on modest hardware. The sparse attention pattern and FP8 numerics are effective for scaling context length and reducing inference cost. The alignment pipeline, with multi-stage SFT, DPO, GRPO, RLHF, and model merging, provides a template for robust instruction-following and tool use.

Future work may explore further compression (e.g., quantization, more aggressive head pruning), extension to multimodal reasoning, and integration with agentic tool-calling frameworks. The open release of models and data will catalyze research in efficient, high-accuracy LLMs for real-world reasoning tasks.

Conclusion

Nemotron Nano 2 sets a new standard for efficient, accurate, and long-context reasoning in open-source LLMs. Its hybrid architecture, data-centric training, and advanced compression strategies enable high-throughput inference and strong benchmark performance, with practical deployment on commodity GPUs. The model and datasets provide a valuable resource for the community, supporting further advances in scalable, robust LLMing.

PDF Markdown

Paper Prompts

Follow-up Questions

Authors (217)

First 10 authors:

Tweets

https://twitter.com/HuggingPapers/status/1959589572779639151

https://twitter.com/rosinality/status/1958409449925931076

alphaXiv

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model (51 likes, 0 questions)