Nemotron-H: Hybrid Mamba-Transformer LLM
- Nemotron-H Architecture is a family of large language models that hybridizes Mamba-2 state-space modules with Transformer self-attention for efficient long-sequence inference.
- The design employs fine-grained neural architecture search, blockwise local distillation, and FP8 quantization to optimize throughput, memory usage, and performance.
- Deployment strategies include hardware-aware optimizations and flexible variant pipelines that enable scalable, low-memory, high-performance inference under tight compute constraints.
Nemotron-H denotes a family of LLMs that systematically integrate state-space model (SSM) modules—specifically Mamba-2 layers—into conventional Transformer stacks, or employ fine-grained architectural search and distillation to optimize for hardware-aware, high-throughput inference. The design goal is efficient, accurate long-sequence autoregressive modeling under tight compute and memory constraints, with empirical performance that matches or exceeds pure-Transformer baselines at similar parameter count. Nemotron-H models encompass both: (1) hybrid Mamba-Transformer architectures in which Mamba-2 SSM layers substitute for a majority of self-attention blocks, and (2) Puzzle-derived variants (such as Nemotron-51B) produced by large-scale neural architecture search over Transformer modules, optimized for FP8 deployment on NVIDIA H100 GPUs (NVIDIA et al., 4 Apr 2025, Bercovich et al., 28 Nov 2024, NVIDIA et al., 20 Aug 2025).
1. Hybrid Mamba–Transformer Architecture
Nemotron-H implements a hybrid decoder-only Transformer stack, in which most self-attention (SA) blocks are replaced by Mamba-2 layers. The canonical configuration for an 8B or 56B base model features the following:
- For Nemotron-H-8B (≈8B parameters): 52 total layers, with 4 self-attention layers (≈8%) evenly interleaved, each followed by an FFN; 24 Mamba-2 layers; 24 FFN layers.
- For Nemotron-H-56B (≈56B): 118 total layers, 10 attention layers, 54 Mamba-2, 54 FFN.
- Placement rules: first layer always Mamba-2; last always FFN; every SA block precedes an FFN, matching Transformer conventions.
Mamba-2 layers supplant SA primarily to eliminate quadratic complexity in sequence length ( per token in SA versus linear-time for Mamba-2). Mamba-2 maintains a recurrent state vector per group, introduces local convolution (window size 4), and is configured so that (e.g., , , in 8B) (NVIDIA et al., 4 Apr 2025, NVIDIA et al., 20 Aug 2025). This transition yields O(1) per-token memory and enables batch sizes and context lengths unfeasible for SA-centric LLMs.
Puzzle-optimized Nemotron-H configurations (e.g., Nemotron-51B) employ fine-grained, non-uniform networks where each layer independently selects from a menu of attention and FFN variants—some as severe as no-op (skip), grouped-query, or low-dim linear (Bercovich et al., 28 Nov 2024). This heterogeneity enables strict hardware-fitting on GPU accelerators.
2. Mamba-2 Layer: Mechanism and Computational Properties
The Mamba-2 layer replaces standard QKV attention with group-wise, gated state-space modeling. At each timestep and group :
The group outputs are concatenated and postprocessed through local convolution (window size 4) and nonlinearity. Empirically, computational complexity per token is , constant with respect to sequence length, as opposed to the for self-attention. SA layer memory grows as floats per layer, while Mamba-2 stores only floats (constant) per token per layer (NVIDIA et al., 4 Apr 2025, NVIDIA et al., 20 Aug 2025).
This significant complexity reduction is especially potent for long-context inference (), where key–value cache and softmax matmuls saturate memory and compute for SA-heavy models.
3. Architecture Search, Pruning, and Distillation Pipelines
Nemotron-H models utilize multiple strategies for maximizing throughput under accuracy and memory constraints:
- Blockwise Local Distillation (BLD): Each candidate block variant (attention or FFN) is trained separately to mimic its parent output while freezing the complement functional path. This enables massive parallelization and efficient search (Bercovich et al., 28 Nov 2024).
- MiniPuzzle/Minitron Compression: For hybrid Mamba-Transformer stacks, layers are quantized for importance by per-layer MSE after removal and by neuron/channel activation norms. Layerwise and width pruning configurations are enumerated to fit memory budgets (e.g., fitting ≤31.7 GiB in FP4 for 47B-Base). The best candidates are selected after short- and long-horizon distillation under forward KL loss (NVIDIA et al., 4 Apr 2025, NVIDIA et al., 20 Aug 2025).
- Mixed-Integer Programming (MIP): For pure Transformer Puzzle derivatives, the architecture search problem is framed as an MIP: maximize total quality (measured by KL divergence on next-token prediction) over all layer/block selections, subject to constraints on summed parameter + KV-cache memory, throughput, and latency (Bercovich et al., 28 Nov 2024).
Distillation trains candidate/pruned models to match either the parent logits or intermediate activations, recovering ≈99% of teacher accuracy after ≈60B tokens (NVIDIA et al., 20 Aug 2025).
4. Hardware-Aware Optimization and FP8 Training
Nemotron-H models are explicitly architected and trained for modern accelerators—especially NVIDIA H100 SXM and A10G—with the following practicalities:
- FP8 Quantization: All dense GEMMs executed in E4M3 FP8 precision (except first/last 4 layers in BF16), enabling ≈2× throughput versus BF16; this quantization scheme is stable when performed with "round towards zero" scaling and block-wise (e.g., 128×128) grouping.
- Memory Fitting: Pruned variants (e.g., 47B, 9B) fit into 32GB, 22GB devices respectively, while accommodating long-context sequences (up to 128K tokens for 9B).
- Serving: Runtime optimized for TensorRT-LLM or vLLM, supporting paged heterogeneous KV head caching, batched streaming, and support for "linear" or "no-op" attention/FFN blocks in FP8 (Bercovich et al., 28 Nov 2024, NVIDIA et al., 20 Aug 2025).
- Scalability: Context-parallel training, model-parallel inference, and hardware-aware pruning facilitate scaling to extreme context windows and batch sizes (e.g., sequence length 512K) (NVIDIA et al., 20 Aug 2025).
FP8 training loss is within 0.1% of BF16, with no observable drop in task accuracy; models trained in FP8 sometimes slightly exceed BF16 performance on downstream benchmarks (NVIDIA et al., 4 Apr 2025).
5. Empirical Performance and Comparative Analysis
Nemotron-H models consistently achieve state-of-the-art or near state-of-the-art benchmark scores with materially higher inference throughput at large-scale sequence modeling:
| Model | Params | Context | Reasoning Benchmarks | Inference Throughput (Tok/s) | Relative Speedup |
|---|---|---|---|---|---|
| Nemotron-H-56B-Base (FP8) | 56B | 65K | GSM8K: 93.7% | ≈2.4× (vs Qwen-2.5-72B/Llama-70B) | |
| Nemotron-H-47B-Base (FP8) | 47B | 65K | ~equal to 56B | ≈2.9× | |
| Nemotron-H-8B-Base | 8B | 65K | Matches/outperforms Gemma-3-12B | ≈1.8×–3× over peers | |
| Nemotron-Nano-9B-v2 (bfloat16) | 9B | 128K | GSM8K CoT: 91.4%, MATH: 80.5% | 156 (8K/16K context; 8 batch) | 6.3× (vs Qwen3-8B) |
Throughput increases are driven primarily by O(1) memory and computation per token in Mamba-2 blocks and architecture-level reductions in parameter and cache requirements per block. Puzzle-derived Nemotron-H-51B achieves a 2.17× speedup versus Llama-70B-Instruct with 98.4% accuracy retention, fitting both parameters and KV cache within an 80GB H100 (Bercovich et al., 28 Nov 2024, NVIDIA et al., 4 Apr 2025, NVIDIA et al., 20 Aug 2025).
6. Implementation and Deployment Considerations
Deployment of Nemotron-H raises several important practices:
- Memory Budgeting: Network and KV-cache sizes are aligned with deployment GPU; parameter and cache pressure is minimized in early and late layers, with full expressivity reserved for mid-depth blocks (Bercovich et al., 28 Nov 2024).
- Mixed Precision: Mixed FP8/BF16 is mandatory for stable training and fast inference.
- Frameworks: vLLM, TensorRT-LLM for serving; Megatron-LM, NeMo for training.
- Reproducibility Tips: Keep first/last four matmuls in BF16; adopt context-parallel inference for long windows; use quantization groups of 128×128 for FP8; follow all known datacenter scaling best practices (NVIDIA et al., 20 Aug 2025).
Long-context stability (e.g., for 512K tokens) is enhanced by context-parallelism and truncation-augmented SFT/GRPO strategies.
7. Variants, Compression, and Alignment Pipelines
Nemotron-H's flexibility extends to multiple downstream- or hardware-specific variants:
- MiniPuzzle/Minitron: Compression pipelines for downsampling model width, layer count, or both under precise empirical loss metrics—layer dropping/width pruning followed by knowledge distillation quickly selects configurations that optimally trade off throughput versus accuracy (NVIDIA et al., 4 Apr 2025, NVIDIA et al., 20 Aug 2025).
- Alignment: Post-distillation alignment entails SFT, DPO, GRPO, and RLHF, followed by parameter mixing (“model soup”) to balance specialized (reasoning, chat, vision) capabilities as needed. E.g., Nemotron-Nano alignment follows a multi-stage sequence for competitive reasoning and conversational ability (NVIDIA et al., 20 Aug 2025).
- Non-uniform Routing: Puzzle-derived variants use a strictly non-homogeneous routing for each layer—some blocks as large full-attention or FFN, others minimal or no-op—enabling highly adaptive and hardware-optimal compute allocation.
Nemotron-H demonstrates that O(1) memory per token, aggressive hybridization of Mamba-2/SA, and hardware-centric NAS can deliver models that fit emerging inference constraints without compromising competitive accuracy across standard LLM and reasoning benchmarks (Bercovich et al., 28 Nov 2024, NVIDIA et al., 4 Apr 2025, NVIDIA et al., 20 Aug 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free