Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 37 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 111 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 243 tok/s Pro
2000 character limit reached

Jet-Nemotron: Hybrid Language Models

Updated 25 August 2025
  • Jet-Nemotron is a hybrid language model architecture that interleaves full-attention, linear attention, and dynamic JetBlock modules for efficient inference and advanced reasoning.
  • The Post Neural Architecture Search (PostNAS) pipeline refines the attention stack via beam search and hardware-aware optimization, significantly reducing computational costs.
  • Empirical results show Jet-Nemotron-2B achieves higher MMLU accuracy and up to 53.6× decoding speedup, supporting applications in real-time conversational AI and long-context processing.

Jet-Nemotron is a family of hybrid-architecture LLMs developed using a novel Post Neural Architecture Search (PostNAS) pipeline. Designed for both high accuracy and substantially increased generation throughput, Jet-Nemotron offers a tightly-integrated blend of full-attention, linear attention, and dynamic attention mechanisms. This hybrid design, coupled with an efficient post-training architecture search strategy, enables Jet-Nemotron models such as Jet-Nemotron-2B to outperform or match leading full-attention models on diverse benchmarks while advancing the state of efficient LLM inference (Gu et al., 21 Aug 2025).

1. Hybrid Model Architecture

Jet-Nemotron abandons the monolithic full-attention paradigm in favor of a hybrid stack in which attention types are strategically interleaved. The architecture distinguishes itself through:

  • Selective Retention of Full-Attention: Only a subset of layers retain traditional full-attention mechanisms—critical for global context aggregation, complex retrieval, and certain reasoning tasks.
  • Insertion of Linear Attention Blocks: Several full-attention layers are replaced by linear attention mechanisms, reducing time and memory complexity from O(n2)O(n^2) to O(n)O(n) in the sequence length nn.
  • Introduction of JetBlock: A proprietary dynamic convolution-based module, JetBlock, integrates the advantages of kernel-based linear attention with dynamically generated convolution kernels, omitting static convolutions on query/key projections. The JetBlock kernel generator is defined as:

k=W2SiLU(W1x),\mathbf{k} = W_2 \operatorname{SiLU}\left(W_1 \mathbf{x}\right),

where W1W_1 and W2W_2 are learned projections, SiLU is the activation, and x\mathbf{x} denotes input features.

  • Sliding Window Attention (SWA): Utilized in tasks necessitating localized, softmax-based attention for pattern matching (e.g., multiple-choice evaluation).

This composite design allows Jet-Nemotron to minimize computational cost without sacrificing the functional expressivity required for advanced reasoning and retrieval benchmarks.

2. Post Neural Architecture Search (PostNAS)

The development of Jet-Nemotron leverages an architecture discovery pipeline that operates post hoc on a pre-trained full-attention backbone:

  • MLP Parameter Freezing: The search commences from an existing full-attention model, with MLP weights frozen, focusing search solely on the attention stack. This reduces training cost and risk, given the high value of mature MLP representations.
  • Four-Stage Pipeline:

    1. Full-Attention Layer Placement and Elimination: A once-for-all "super network" encodes all candidate architectures. Beam search, guided by benchmark-driven objectives (e.g., minimizing loss on MMLU), identifies an optimal full-attention/linear attention/SWA schedule.
    2. Linear Attention Block Selection: Several SOTA linear attention modules (e.g., RWKV7, RetNet, Mamba2, Gated DeltaNet) are evaluated using the genuine pre-training task to identify the module with maximal transfer accuracy. Gated DeltaNet yielded leading results, informing the design of JetBlock.
    3. JetBlock Design: The kernel generator module dynamically parameterizes convolutional kernels based on input activations, enhancing linear attention expressivity with minimal computational overhead.
    4. Hardware-Aware Hyperparameter Search: Rather than using parameter count as a proxy, throughput is directly optimized, notably considering the key-value (KV) cache size—a primary bottleneck for decoding speed. Hyperparameters (Q/K/V dim, number of heads) are optimized within a fixed KV cache budget for maximum device throughput.

This pipeline dramatically reduces the experimental cost of model discovery while enabling hardware-constrained optimization.

3. Empirical Performance

Jet-Nemotron-2B demonstrates competitive or superior performance compared to recent models such as Qwen3, Qwen2.5, Gemma3, and Llama3.2 across major LLMing and reasoning benchmarks, including MMLU, Mathematical Reasoning, Commonsense QA, and Retrieval:

Model MMLU Accuracy Decoding Throughput (tokens/s) Prefill Speedup
Qwen3-1.7B-Base 61
Jet-Nemotron-2B Higher* 2,885 6.1×

*Jet-Nemotron-2B attains higher accuracy on MMLU and MMLU-Pro than DeepSeek-V3-Small and Moonlight, models with 15B total and 2.2B activated parameters.

  • On long-context tasks (tested up to 256K tokens on NVIDIA H100), Jet-Nemotron-2B achieves up to 53.6× decoding speedup versus Qwen3-1.7B-Base, while maintaining or surpassing baseline accuracy.

  • Performance gains are most pronounced on throughput-critical tasks, with accuracy retained or improved across math, retrieval, and common sense queries.

4. Applications and Broader Implications

Jet-Nemotron's architectural advances yield practical advantages for several domains:

  • Real-time Conversational AI: Low latency and efficient context handling benefit dialogue agents deployed in interactive environments.
  • Long-Context Processing: Summarization, retrieval-augmented generation, and document-level contextualization are improved via the efficient hybrid attention scheme.
  • Structured Code Generation and Mathematical Reasoning: Strategic placement of full-attention layers ensures performance parity on complex reasoning and step-resolved tasks.
  • Model Democratization: The PostNAS paradigm enables the upcycling of mature models, reducing both the energy and economic cost associated with innovation in LLMs.

A plausible implication is the emergence of a new standard workflow in model development, shifting the research focus away from training-from-scratch toward principled, hardware-aware post-training architectural refinement.

5. Technical Challenges and Solution Strategies

Several challenges were encountered and resolved during Jet-Nemotron's development:

  • Pre-training Cost: Training new models from scratch was circumvented by leveraging pre-existing full-attention networks and freezing the MLP, allowing efficient post hoc exploration of the attention mechanism space.
  • Attention Layer Importance: Uniform replacement of full-attention with efficient blocks is often suboptimal; the super network and beam search provided task-conditioned scheduling of attention types, mitigating performance bottlenecks.
  • Linear Block Expressivity: Existing linear options failed to consistently match full-attention accuracy; JetBlock addressed this by integrating dynamic convolutions and removing redundant computations (such as static convolutions on queries and keys).
  • Inference Bottlenecks: Parameter count proved a poor surrogate for speed. Direct hardware-aware search prioritized limits such as KV cache footprint, yielding greater device-centric gains without regressing on accuracy.

6. Future Directions

This suggests that future research may focus on advancing the expressivity and adaptability of linear attention mechanisms, integrating domain-specific attention scheduling, and further automating hardware-aware architecture refinement. The PostNAS workflow provides a scalable template for extending performant, resource-efficient transformer variants to both LLMing and other sequence modeling domains. The hybrid stacking principle is likely extensible to vision-language and multimodal tasks, where diverse contextual dependencies coexist alongside stringent efficiency requirements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube