Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 105 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 41 tok/s
GPT-5 High 42 tok/s Pro
GPT-4o 104 tok/s
GPT OSS 120B 474 tok/s Pro
Kimi K2 256 tok/s Pro
2000 character limit reached

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model (2508.14444v1)

Published 20 Aug 2025 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer LLM designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a hybrid Mamba-Transformer model that replaces traditional self-attention layers with Mamba-2 layers to significantly enhance inference speed.
  • It leverages a unique FP8 pre-training approach over 20 trillion tokens and incorporates fine-tuning techniques such as SFT, GRPO, DPO, and RLHF to improve adaptability across domains.
  • The model achieves up to 6.3 times higher throughput than competitors while efficiently managing long-context reasoning on a single A10G GPU.

"NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model" (2508.14444)

Introduction to Nemotron Nano 2

The paper presents Nemotron-Nano-9B-v2, a Mamba-Transformer hybrid model designed to optimize throughput in reasoning tasks while maintaining high accuracy. The Nemotron architecture innovatively replaces most self-attention layers with Mamba-2 layers to significantly speed up inference, especially in generating extensive reasoning traces. Figure 1

Figure 1: A comparison of the Nemotron Nano 2 and Qwen3-8B in terms of accuracy and throughput.

Model Architecture and Pre-Training

The base model, Nemotron-Nano-12B-v2-Base, was pre-trained over 20 trillion tokens using a unique FP8 training approach. It was designed with 62 layers, with self-attention layers strategically distributed to enhance inference capabilities across extended sequences. Figure 2

Figure 2: Nemotron-Nano-12B-v2-Base layer pattern highlighting the distribution of self-attention layers.

The model utilized a vast corpus, comprising curated web data, multilingual datasets, and specialized subsets like mathematical and code data. This diverse pre-training dataset helped achieve superior performance on reasoning benchmarks compared to existing models.

Post-Training and Alignment

Post-training included Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO), and Reinforcement Learning from Human Feedback (RLHF). These phases enhanced the model's adaptability to various domains, including the ability to manage long-context interactions and tool usage efficiently.

Compression and Distillation

To allow inference over 128k token contexts on a single A10G GPU, the model was compressed using Minitron's pruning and distillation techniques. This involved careful tuning of the model's architecture and parameter space without compromising on accuracy.

Evaluation and Performance

Nemotron Nano 2 achieves markedly better results than Qwen3-8B on complex reasoning tasks, delivering up to 6.3 times higher throughput while maintaining similar or superior accuracy. This performance is crucial for applications in domains requiring extensive reasoning, such as complex mathematical problem-solving or multilingual content understanding. Figure 3

Figure 3: Task accuracy at different stages of the distillation pipeline for Nemotron Nano 2.

Conclusion

In summary, Nemotron-Nano-9B-v2 stands out as a highly efficient reasoning model due to its innovative hybrid architecture and optimization strategies. Its ability to handle extensive reasoning tasks with high throughput opens new possibilities for AI applications requiring real-time reasoning capabilities in resource-constrained environments. Future work may explore further compression techniques and extensions to other domain-specific reasoning tasks.

X Twitter Logo Streamline Icon: https://streamlinehq.com