Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 75 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 34 tok/s
GPT-5 High 32 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 471 tok/s Pro
Kimi K2 200 tok/s Pro
2000 character limit reached

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search (2508.15884v1)

Published 21 Aug 2025 in cs.CL, cs.AI, and cs.LG

Abstract: We present Jet-Nemotron, a new family of hybrid-architecture LLMs, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces the PostNAS pipeline, which optimizes full-attention placement and linear attention block selection using frozen pre-trained MLPs.
  • The paper achieves substantial efficiency gains with up to 47× faster generation and a significant reduction in KV cache size for long-context tasks.
  • The paper demonstrates that the novel JetBlock design, integrated within a hybrid architecture, maintains state-of-the-art accuracy across diverse benchmarks.

Jet-Nemotron: Efficient LLM with Post Neural Architecture Search

Introduction and Motivation

Jet-Nemotron introduces a new family of hybrid-architecture LLMs that achieve state-of-the-art accuracy while delivering substantial improvements in generation throughput. The central innovation is the Post Neural Architecture Search (PostNAS) pipeline, which enables efficient architecture exploration by leveraging pre-trained full-attention models and freezing their MLP weights. This approach circumvents the prohibitive cost and risk of pre-training from scratch, allowing rapid and hardware-aware adaptation of attention mechanisms.

The motivation stems from the quadratic complexity of self-attention in Transformers, which impedes efficiency, especially for long-context tasks. Prior work on linear and hybrid attention models has improved throughput but typically at the expense of accuracy on challenging benchmarks. Jet-Nemotron addresses this gap by systematically optimizing the placement and design of attention blocks, achieving both high accuracy and efficiency. Figure 1

Figure 1: Jet-Nemotron models outperform state-of-the-art efficient LLMs in both accuracy and generation throughput on NVIDIA H100 GPUs at 64K context length.

The PostNAS pipeline comprises four key stages:

  1. Full Attention Placement and Elimination: Instead of uniform placement, PostNAS learns the optimal locations for full-attention layers using a once-for-all super network and beam search. This method reveals that only a small subset of layers is critical for specific tasks, and their importance varies across domains (e.g., MMLU vs. retrieval). Figure 2

    Figure 2: The PostNAS pipeline starts from a pre-trained full-attention model, freezes the MLP, and performs a coarse-to-fine search for efficient attention block designs.

    Figure 3

    Figure 3: PostNAS trains a super network and uses beam search to identify optimal full-attention layer placement.

    Figure 4

    Figure 4: (a) Layer-wise search objective values for Qwen2.5-1.5B, showing non-uniform importance. (b) PostNAS placement yields higher accuracy than uniform strategies.

  2. Linear Attention Block Selection: PostNAS evaluates multiple state-of-the-art linear attention blocks (RWKV7, RetNet, Mamba2, GLA, Deltanet, Gated DeltaNet) in the context of the frozen MLP setup. Gated DeltaNet is selected for its superior accuracy, attributed to its data-dependent gating and delta rule mechanisms.
  3. New Attention Block Design (JetBlock): JetBlock introduces dynamic causal convolution kernels generated conditionally on the input, applied to the value tokens. This design removes redundant static convolutions on Q/K, streamlining computation and improving accuracy with minimal overhead.
  4. Hardware-Aware Architecture Search: Rather than optimizing for parameter count, PostNAS targets generation throughput directly. The key finding is that KV cache size, not parameter count, is the dominant factor for throughput, especially in long-context scenarios. By fixing cache size and searching over key/value dimensions and head numbers, Jet-Nemotron achieves higher accuracy without sacrificing efficiency. Figure 5

    Figure 5: PostNAS delivers significant accuracy improvements across all benchmarks compared to the baseline.

Empirical Results

Jet-Nemotron models are evaluated on a comprehensive suite of benchmarks: MMLU(-Pro), mathematical reasoning, commonsense reasoning, retrieval, coding, and long-context tasks. The results demonstrate:

  • Jet-Nemotron-2B achieves higher accuracy than Qwen3-1.7B-Base on MMLU-Pro, with a 47× increase in generation throughput and a 47× reduction in KV cache size.
  • Jet-Nemotron-4B maintains throughput advantages even as model size increases, outperforming all full-attention models with less than 2B parameters.
  • On math tasks, Jet-Nemotron-2B surpasses Qwen3-1.7B-Base by 6.3 points in accuracy while being 47× faster.
  • On commonsense reasoning, Jet-Nemotron-2B achieves the highest average accuracy among all baselines.
  • On retrieval and coding tasks, Jet-Nemotron models consistently match or exceed the best full-attention and hybrid baselines.
  • For long-context tasks (up to 64K tokens), Jet-Nemotron-2B and 4B deliver competitive or superior accuracy with dramatically higher throughput. Figure 6

    Figure 6: Jet-Nemotron-2B achieves up to 6.14× speedup in prefilling and 53.6× speedup in decoding compared to Qwen3-1.7B-Base across context lengths.

Implementation and Practical Considerations

Jet-Nemotron is constructed by adapting pre-trained Qwen2.5 models, replacing most attention layers with JetBlock, and strategically retaining a small number of full-attention and sliding window attention layers. The training process involves two stages: distillation with frozen MLPs, followed by full-model training with additional high-quality data. Throughput is measured using chunk-prefilling and optimized batch sizes on H100 GPUs.

The hardware-aware search ensures that Jet-Nemotron models are not only efficient in terms of FLOPs but also optimized for real-world deployment on modern accelerators. The reduction in KV cache size enables larger batch sizes and higher parallelism, which is critical for serving long-context applications.

Theoretical and Practical Implications

Jet-Nemotron demonstrates that post-training architecture adaptation is a viable and effective strategy for advancing both efficiency and accuracy in LLMs. The findings challenge the conventional reliance on parameter count as a proxy for efficiency and highlight the importance of KV cache optimization. The dynamic convolutional design in JetBlock suggests new directions for enhancing linear attention mechanisms.

The PostNAS framework provides a rapid testbed for architectural innovation, filtering out unpromising designs before committing to expensive pre-training. This paradigm can accelerate progress in efficient model design, especially for organizations with limited computational resources.

Future Directions

Potential future developments include:

  • Extending PostNAS to other model families and modalities (e.g., vision, multimodal).
  • Further optimizing kernel implementations for JetBlock to improve short-context throughput.
  • Exploring adaptive attention placement strategies conditioned on input or task.
  • Integrating PostNAS with automated data selection and curriculum learning for even more efficient adaptation.

Conclusion

Jet-Nemotron, enabled by PostNAS and JetBlock, sets a new standard for efficient LLMing by matching or exceeding the accuracy of leading full-attention models while delivering order-of-magnitude improvements in generation throughput. The work establishes post-training architecture search as a practical and theoretically sound approach for scalable, hardware-aware model design, with broad implications for the future of efficient AI systems.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run paper prompts using GPT-5.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. LLM Speed Up Breakthrough? (2 points, 0 comments)
Reddit Logo Streamline Icon: https://streamlinehq.com