Papers
Topics
Authors
Recent
Search
2000 character limit reached

Nemotron-H: Hybrid Mamba–Transformer Model

Updated 12 May 2026
  • Nemotron-H is a family of hybrid Mamba–Transformer models that interleave multi-head self-attention with constant-per-token Mamba-2 layers for efficient long-context reasoning.
  • Model variants range from 8B to 56B parameters and use dedicated pruning–distillation pipelines to achieve up to 50% layer reduction with <1% accuracy loss.
  • The architecture supports up to 128K-token inference via FP8 training, making it suitable for hardware-constrained environments while maintaining competitive performance.

Nemotron-H is a family of hybrid Mamba–Transformer models designed for accurate and efficient inference, particularly in long context and reasoning workloads. By interleaving conventional multi-head self-attention with constant-per-token state-space (Mamba-2) layers, Nemotron-H achieves state-of-the-art accuracy at substantially reduced memory and compute cost relative to pure Transformer baselines. Variants range from 8B to 56B parameters, with further compression to 47B and 9B regimes via dedicated pruning–distillation pipelines. Nemotron-H establishes a practical path beyond quadratic attention scaling, supporting long context lengths (up to 128K tokens in downstream Nano variants) and leveraging novel FP8 training, making large-scale reasoning accessible on hardware-constrained environments (NVIDIA et al., 4 Apr 2025, NVIDIA et al., 20 Aug 2025).

1. Hybrid Mamba–Transformer Architecture

Nemotron-H replaces most self-attention layers in the Transformer stack with structured state-space Mamba-2 blocks, while retaining a small fraction (typically ≈8%) of self-attention layers. Each Nemotron-H model interleaves three components:

  • Self-attention (multi-head, ≈8% of layers)
  • Mamba-2 state-space layers (linear-time per token)
  • Standard 2-layer feed-forward (FFN) blocks

For example, Nemotron-H-56B-Base contains 118 layers: 10 self-attention, 54 Mamba-2, and 54 FFN. Layers are distributed such that each attention block is succeeded by an FFN, with Mamba-2 layers filling the remainder. Mamba-2 layers maintain a constant-sized hidden state per group and update it in constant time per token, contrasting with self-attention’s quadratic cost.

The Mamba-2 block operates as follows. Given input XRB×S×DX \in \mathbb{R}^{B \times S \times D}:

  • Project to several heads:

U=XWxTU = X W_x^T, Z=XWzTZ = X W_z^T, B=XWBTB = X W_B^T, C=XWCTC = X W_C^T

  • Apply group-wise causal convolution KUK * U (the SSM kernel)
  • Apply gating and output projection:

Mamba(X)=WO[σ(B)(KU)+C]\text{Mamba}(X) = W_O \left[ \sigma(B) \odot (K * U) + C \right]

Xout=X+RMSNorm(Mamba(X))X_\text{out} = X + \operatorname{RMSNorm}(\text{Mamba}(X))

with GG Mamba groups, per-group state dimension SS or U=XWxTU = X W_x^T0, and all projections learnable.

Computational cost per generated token:

  • Self-attention (per layer): U=XWxTU = X W_x^T1, U=XWxTU = X W_x^T2
  • Mamba-2 (per layer): U=XWxTU = X W_x^T3, U=XWxTU = X W_x^T4

Neither term in U=XWxTU = X W_x^T5 nor U=XWxTU = X W_x^T6 grows with sequence length U=XWxTU = X W_x^T7, allowing for efficient very-long-context inference.

2. Model Variants, Compression, and Parameterization

Nemotron-H is instantiated in several parameter regimes:

Model Layers (attn, mamba, ffn) Hidden size U=XWxTU = X W_x^T8 FFN size U=XWxTU = X W_x^T9 Groups/State
N-H-8B-Base 4, 24, 24 4096 21,504 32/128
N-H-56B-Base 10, 54, 54 8192 32,768 64/256
N-H-47B-Base 5, 44, 49 8192 30,720 64/256

MiniPuzzle compression (applied to shrink the 56B-Base to 47B-Base) performs:

  • Layer importance scoring (dropout-induced MSE for pruning)
  • Conditional search (NAS) to enumerate ≈400 candidate prunings
  • Rapid screening by next-token accuracy and teacher agreement
  • Small-benchmark evaluation for the top-130, selection of top-3
  • Two-stage distillation: short (7B tokens) KL, long (63B tokens), using the 56B teacher

Final pruning achieves:

  • Self-attention: 10 → 5 layers (50%)
  • Mamba-2: 54 → 44 (19%)
  • FFN: 54 → 49 (9%), width 32,768 → 30,720 (6% narrower)

Accuracy loss is Z=XWzTZ = X W_z^T0 on standard LLM benchmarks. The 47B model fits in 32 GiB FP4, runs 1.2× faster than the 56B, and supports large sequence lengths.

Downstream, Nemotron-Nano-12B-v2-Base and Nemotron-Nano-9B-v2 apply similar principles, with the latter fitting 128K tokens on a single 22GiB A10G and achieving 3–6× speedups against Qwen3-8B (NVIDIA et al., 20 Aug 2025).

3. Training Regimes: FP8 Precision

Nemotron-H introduces an FP8 pipeline for efficient large-scale training:

  • All linear layers (two FFN GEMMs, Q/K/V/out-proj in attention) are in FP8 (E4M3 for weights/activations, E5M2 for gradients) except first/last four layers (kept in BF16)
  • Per-tensor current scaling: Z=XWzTZ = X W_z^T1, tensor scaled elementwise, then rounded to FP8, underflows flushed to zero, rounding-toward-zero is optimal
  • FP8 models are trained for the same number of tokens as BF16; e.g., 20T tokens for 56B (NVIDIA et al., 4 Apr 2025) and 12B (NVIDIA et al., 20 Aug 2025).
  • Validation and training log-likelihood loss difference is ≈0.1% initially, converging to matching downstream accuracy (GSM8K, MATH, HumanEval).

FP8 nearly halves storage and memory bandwidth demands for the bulk of compute, enabling either larger batch size or reduced GPU requirements. Stability is ensured by retaining high precision at the edges (first/last four layers), and tuning rounding modes.

4. Downstream Compression: MiniPuzzle and Minitron

MiniPuzzle (for 47B) and Minitron (for 9B) provide systematic model compression pipelines:

  • Layer-specific MSE and activation importance determine pruning order
  • Conditional NAS with rapid candidate enumeration/filtering explores hundreds of sub-architectures per hardware/memory/latency constraint
  • Short knowledge-distillation (KL loss, FP8) aligns student with teacher on 1M held-out tokens
  • Final accuracy recovery via extended distillation (63B tokens for 47B, 60B/25B/1B tokens at increasing context for 9B (NVIDIA et al., 20 Aug 2025))
  • Neuron and channel pruning adapt embedding and FFN widths per budget

Resulting student models match >99% of teacher accuracy while dramatically reducing inference cost. For instance, the 9B model (56 layers) fits 128K-token contexts in 19.66 GiB (A10G) with 5% memory headroom for KV caches and vision modules (NVIDIA et al., 20 Aug 2025).

5. Empirical Performance and Evaluation

Empirical results establish Nemotron-H’s superiority in inference efficiency and competitive accuracy:

Model MMLU-Pro GSM8K HumanEval Throughput (tok/s/GPU) Speedup vs. Qwen/Llama
N-H-56B-Base 60.5 93.7 60.4 14.0k 2.4× / 2.8×
N-H-47B-Base 61.8 93.3 61.0 17.2k 2.9× / 3.4×
Qwen-2.5-72B 58.8 90.9 56.7 5.8k
Llama-3.1-70B 51.3 83.9 57.3 5.0k

For 8B models, N-H-8B achieves 1.8× throughput over Qwen-7B and 3.0× over Llama-8B on long contexts (65,536). On 17 tasks, the 56B is top-1 on 9 vs. Qwen-2.5-72B and 7 vs. Llama-3.1-70B; the 47B compressed model is often equal or superior in end-task accuracy.

Nemotron-Nano-9B has demonstrated 3–6× higher throughput over Qwen3-8B for 8K/16K token reasoning, with comparable or higher accuracy on GSM8K, MATH, HumanEval+, MMLU-Pro, and RULER @128K context (NVIDIA et al., 20 Aug 2025).

6. Long-context Reasoning and Hardware Integration

Nemotron-H’s architecture, Mamba-2 design, and pruning–distillation pipelines enable:

  • Support for 128K sequence lengths (Nano-9B) at single-GPU memory budgets (<20 GiB with KV caches and optional vision heads)
  • Linear scaling in sequence, both in compute and memory, for the majority of the stack
  • Reduced hardware requirements, allowing state-of-the-art long-context reasoning on infrastructure as modest as A10G, H100, or similar

Integration is supported by Hugging Face (nvidia/nemotron-h-56b-base etc.) and NeMo APIs, with automatic precision selection and existing Megatron-LM kernels. For example:

Z=XWzTZ = X W_z^T2

Nemotron-H variants can thus be readily deployed, with support for FP8 where compatible libraries/hardware are available.

7. Significance and Prospects

Nemotron-H demonstrates that hybridization—judicious insertion of constant-state SSM blocks into Transformer stacks—can relax the quadratic bottlenecks of canonical attention without sacrificing downstream capability. These architectures consistently match or surpass larger pure-attention models on both general and reasoning benchmarks, and compression tools (MiniPuzzle, Minitron) allow scaling to severe hardware and memory constraints. FP8 pipeline adoption further amplifies resource efficiency. A plausible implication is that SSM–attention hybrid architectures and low-precision training regimes will continue to shape the foundation model landscape, especially for tasks with extreme context requirements or budgeted environments (NVIDIA et al., 4 Apr 2025, NVIDIA et al., 20 Aug 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nemotron-H.