Native Hybrid Attention (NHA)
- Native Hybrid Attention (NHA) is a hybrid attention architecture that unifies linear RNN-based long-term memory with sliding window softmax for accurate short-term recall.
- It employs a unified softmax over concatenated key-value caches to dynamically allocate attention between compressed memory and recent tokens without extra parameters.
- Empirical evaluations show NHA achieves state-of-the-art recall and efficiency on commonsense, long-context, and recall-sensitive tasks, and integrates seamlessly with pretrained LLMs.
Native Hybrid Attention (NHA) is a hybrid attention architecture designed for efficient sequence modeling. It unifies linear recurrent state compression and softmax-based sliding window attention, integrating both intra-layer and inter-layer hybridization in a structurally uniform layer. NHA maintains long-term context in a fixed set of key-value slots updated via a linear RNN and augments this memory with short-term context from a sliding window of recent tokens. A single softmax attention operation over the concatenated hybrid cache yields per-token, per-head dynamic allocation between long-term and short-term information, without additional parameters or fusion networks. Modulation between purely linear and full Transformer behavior is achieved by tuning a single hyperparameter: the window size. NHA demonstrates state-of-the-art empirical performance on recall-intensive, commonsense, and long-context tasks, and can structurally hybridize with existing pretrained LLMs while providing substantial efficiency improvements (Du et al., 8 Oct 2025).
1. Unified Hybrid Layer Architecture
NHA layers encapsulate dual memory structures: a long-term memory comprised of fixed slots updated through a gated linear recurrence, and a short-term memory realized by a sliding window of the last exact tokens. At each time step , projected keys and values are integrated as follows:
- Long-term memory update via linear RNN: A gating mechanism controls per-slot updates,
- Short-term memory: A buffer of the last keys and values:
- Hybrid key-value (KV) cache: Both memories are concatenated,
A unified softmax attention is applied over 0 for each query, creating a context-dependent allocation between memories. This avoids independently tuned fusion parameters and enables smooth context blending per token and head (Du et al., 8 Oct 2025).
Inter-layer hybridization is achieved by uniformly applying the same mechanism at each layer, with the window size 1 as the sole varying hyperparameter. Varying 2 from 3 (pure linear RNN) to 4 (full softmax attention) enables interpolation across the efficiency-recall tradeoff without any architectural changes or parameter increases.
2. Complexity and Scaling Properties
NHA achieves near-linear runtime and memory scaling with sequence length 5:
- Full softmax attention: 6 per layer, 7 memory.
- Linear RNN attention: 8 time, 9 memory; lacks precise short-term recall.
- NHA: Combined complexity of 0 time and 1 memory, as both long-term slots (2) and window size (3) are much smaller than 4.
The total per-token, per-layer compute is dominated by softmax computation over 5 elements, scaling efficiently with growing context lengths and outpacing architectures such as FlashAttention on long inputs. Empirically, NHA matches or exceeds the speed of other linear hybrid approaches while retaining precision in recall-intensive regimes (Du et al., 8 Oct 2025).
3. Unified Softmax and Fusion Mechanism
The core of NHA is a single softmax across the hybrid set of 6 keys and values:
7
This enables the model, for each query and attention head, to allocate focus dynamically between compressed long-term history and exact recent tokens. No additional fusion network or weighting parameters are required. Empirical ablations confirm that this unified operation is critical: replacing it with fixed or learned weighted-sum fusions results in substantial losses in recall accuracy (from 38.60 to 34.06/33.59 avg recall; Table 7) (Du et al., 8 Oct 2025).
4. Empirical Performance and Ablation Studies
NHA exhibits leading accuracy and efficiency profiles on a range of benchmarks:
- Recall-sensitive QA tasks: On datasets such as FDA, SWDE, SQuAD, NQ, TriviaQA, and DROP, NHA surpasses other hybrids, attaining 38.60 (340M params) and 45.58 (1.3B params) average recall, outperforming the next-best hybrids.
- Commonsense reasoning: On WikiText, LAMBADA, ARC-E, ARC-C, PIQA, HellaSwag, and WinoGrande, NHA achieves the lowest perplexities and the highest mean accuracies (Table 1).
- Long-context extrapolation: RULER benchmarks reveal superior scaling from 1K to 8K tokens relative to both Transformers and hybrid baselines.
- Ablation with respect to memory types: Removing long-term slots reduces recall from 38.60 to 29.58; removal of short-term drops recall to 36.97.
- Ablation of fusion strategies and memory parameters: Best observed recall is achieved at 8; very small 9 reduces short-term precision, and very large 0 converges to full softmax behavior (Appendix A.4).
These results establish that both memory components and the unconstrained per-query blending enabled by the unified softmax are essential to NHA's empirical efficacy (Du et al., 8 Oct 2025).
5. Interpolation Across Attention Regimes
A defining property of NHA is its continuous interpolation capability. By varying 1, NHA emulates:
- Linear RNN attention: 2.
- Full Transformer softmax: 3 (where 4 is the sequence length).
Intermediate settings allow for fine-grained selection of the efficiency-recall tradeoff. Critically, this is achieved with no change in model parameters or architecture (weight sharing across layers), permitting application to pretrained models and diverse memory-constrained scenarios. This property differentiates NHA from prior hybrids requiring bespoke parameterizations or multi-branch fusion layers (Du et al., 8 Oct 2025).
6. Integration with Pretrained LLMs
NHA can be directly integrated with open-source pretrained LLMs. In experiments replacing all but four Transformer layers with NHA layers (appropriately tuning 5 per block), models such as Llama-3-8B and Qwen2.5-7B retained original accuracy on benchmarks (e.g., ARC-E: 80.76 for NHA-Llama versus 80.30 for full Llama) while realizing 20–40% improvements in inference latency and memory at input lengths of 1K–4K tokens. This demonstrates that NHA layers can substitute for full attention layers in large-scale autoregressive models with minimal finetuning, yielding both performance and scalability benefits. Larger checkpoints such as Qwen3-30B-A3B show comparable trends, indicating robust scalability of NHA hybridization (Du et al., 8 Oct 2025).
7. Summary and Research Context
Native Hybrid Attention combines linear and softmax-based attention paradigms into a structurally uniform, parameter-efficient layer. Its hybrid memory structure and unified attention operation deliver near-linear complexity, strong recall, and flexible matching to a spectrum of application requirements through adjustment of a single window-size hyperparameter. The architecture is empirically validated to outperform state-of-the-art hybrid and full-attention baselines on recall-intensive, commonsense, and extrapolation tasks, and integrates successfully with pretrained LLMs, offering significant practical efficiency gains. NHA thus represents a unified framework for efficient sequence modeling, balancing the tradeoffs of context compression and local precision in modern neural network attention mechanisms (Du et al., 8 Oct 2025).