Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 119 tok/s Pro
Kimi K2 175 tok/s Pro
GPT OSS 120B 362 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Jet-Nemotron: Hybrid Language Models

Updated 25 August 2025
  • Jet-Nemotron is a hybrid language model architecture that interleaves full-attention, linear attention, and dynamic JetBlock modules for efficient inference and advanced reasoning.
  • The Post Neural Architecture Search (PostNAS) pipeline refines the attention stack via beam search and hardware-aware optimization, significantly reducing computational costs.
  • Empirical results show Jet-Nemotron-2B achieves higher MMLU accuracy and up to 53.6× decoding speedup, supporting applications in real-time conversational AI and long-context processing.

Jet-Nemotron is a family of hybrid-architecture LLMs developed using a novel Post Neural Architecture Search (PostNAS) pipeline. Designed for both high accuracy and substantially increased generation throughput, Jet-Nemotron offers a tightly-integrated blend of full-attention, linear attention, and dynamic attention mechanisms. This hybrid design, coupled with an efficient post-training architecture search strategy, enables Jet-Nemotron models such as Jet-Nemotron-2B to outperform or match leading full-attention models on diverse benchmarks while advancing the state of efficient LLM inference (Gu et al., 21 Aug 2025).

1. Hybrid Model Architecture

Jet-Nemotron abandons the monolithic full-attention paradigm in favor of a hybrid stack in which attention types are strategically interleaved. The architecture distinguishes itself through:

  • Selective Retention of Full-Attention: Only a subset of layers retain traditional full-attention mechanisms—critical for global context aggregation, complex retrieval, and certain reasoning tasks.
  • Insertion of Linear Attention Blocks: Several full-attention layers are replaced by linear attention mechanisms, reducing time and memory complexity from O(n2)O(n^2) to O(n)O(n) in the sequence length nn.
  • Introduction of JetBlock: A proprietary dynamic convolution-based module, JetBlock, integrates the advantages of kernel-based linear attention with dynamically generated convolution kernels, omitting static convolutions on query/key projections. The JetBlock kernel generator is defined as:

k=W2SiLU(W1x),\mathbf{k} = W_2 \operatorname{SiLU}\left(W_1 \mathbf{x}\right),

where W1W_1 and W2W_2 are learned projections, SiLU is the activation, and x\mathbf{x} denotes input features.

  • Sliding Window Attention (SWA): Utilized in tasks necessitating localized, softmax-based attention for pattern matching (e.g., multiple-choice evaluation).

This composite design allows Jet-Nemotron to minimize computational cost without sacrificing the functional expressivity required for advanced reasoning and retrieval benchmarks.

2. Post Neural Architecture Search (PostNAS)

The development of Jet-Nemotron leverages an architecture discovery pipeline that operates post hoc on a pre-trained full-attention backbone:

  • MLP Parameter Freezing: The search commences from an existing full-attention model, with MLP weights frozen, focusing search solely on the attention stack. This reduces training cost and risk, given the high value of mature MLP representations.
  • Four-Stage Pipeline:

    1. Full-Attention Layer Placement and Elimination: A once-for-all "super network" encodes all candidate architectures. Beam search, guided by benchmark-driven objectives (e.g., minimizing loss on MMLU), identifies an optimal full-attention/linear attention/SWA schedule.
    2. Linear Attention Block Selection: Several SOTA linear attention modules (e.g., RWKV7, RetNet, Mamba2, Gated DeltaNet) are evaluated using the genuine pre-training task to identify the module with maximal transfer accuracy. Gated DeltaNet yielded leading results, informing the design of JetBlock.
    3. JetBlock Design: The kernel generator module dynamically parameterizes convolutional kernels based on input activations, enhancing linear attention expressivity with minimal computational overhead.
    4. Hardware-Aware Hyperparameter Search: Rather than using parameter count as a proxy, throughput is directly optimized, notably considering the key-value (KV) cache size—a primary bottleneck for decoding speed. Hyperparameters (Q/K/V dim, number of heads) are optimized within a fixed KV cache budget for maximum device throughput.

This pipeline dramatically reduces the experimental cost of model discovery while enabling hardware-constrained optimization.

3. Empirical Performance

Jet-Nemotron-2B demonstrates competitive or superior performance compared to recent models such as Qwen3, Qwen2.5, Gemma3, and Llama3.2 across major LLMing and reasoning benchmarks, including MMLU, Mathematical Reasoning, Commonsense QA, and Retrieval:

Model MMLU Accuracy Decoding Throughput (tokens/s) Prefill Speedup
Qwen3-1.7B-Base 61
Jet-Nemotron-2B Higher* 2,885 6.1×

*Jet-Nemotron-2B attains higher accuracy on MMLU and MMLU-Pro than DeepSeek-V3-Small and Moonlight, models with 15B total and 2.2B activated parameters.

  • On long-context tasks (tested up to 256K tokens on NVIDIA H100), Jet-Nemotron-2B achieves up to 53.6× decoding speedup versus Qwen3-1.7B-Base, while maintaining or surpassing baseline accuracy.

  • Performance gains are most pronounced on throughput-critical tasks, with accuracy retained or improved across math, retrieval, and common sense queries.

4. Applications and Broader Implications

Jet-Nemotron's architectural advances yield practical advantages for several domains:

  • Real-time Conversational AI: Low latency and efficient context handling benefit dialogue agents deployed in interactive environments.
  • Long-Context Processing: Summarization, retrieval-augmented generation, and document-level contextualization are improved via the efficient hybrid attention scheme.
  • Structured Code Generation and Mathematical Reasoning: Strategic placement of full-attention layers ensures performance parity on complex reasoning and step-resolved tasks.
  • Model Democratization: The PostNAS paradigm enables the upcycling of mature models, reducing both the energy and economic cost associated with innovation in LLMs.

A plausible implication is the emergence of a new standard workflow in model development, shifting the research focus away from training-from-scratch toward principled, hardware-aware post-training architectural refinement.

5. Technical Challenges and Solution Strategies

Several challenges were encountered and resolved during Jet-Nemotron's development:

  • Pre-training Cost: Training new models from scratch was circumvented by leveraging pre-existing full-attention networks and freezing the MLP, allowing efficient post hoc exploration of the attention mechanism space.
  • Attention Layer Importance: Uniform replacement of full-attention with efficient blocks is often suboptimal; the super network and beam search provided task-conditioned scheduling of attention types, mitigating performance bottlenecks.
  • Linear Block Expressivity: Existing linear options failed to consistently match full-attention accuracy; JetBlock addressed this by integrating dynamic convolutions and removing redundant computations (such as static convolutions on queries and keys).
  • Inference Bottlenecks: Parameter count proved a poor surrogate for speed. Direct hardware-aware search prioritized limits such as KV cache footprint, yielding greater device-centric gains without regressing on accuracy.

6. Future Directions

This suggests that future research may focus on advancing the expressivity and adaptability of linear attention mechanisms, integrating domain-specific attention scheduling, and further automating hardware-aware architecture refinement. The PostNAS workflow provides a scalable template for extending performant, resource-efficient transformer variants to both LLMing and other sequence modeling domains. The hybrid stacking principle is likely extensible to vision-language and multimodal tasks, where diverse contextual dependencies coexist alongside stringent efficiency requirements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Jet-Nemotron.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube