Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Global Attention (HGA)

Published 29 Jun 2026 in cs.LG and cs.AI | (2606.30709v1)

Abstract: Hierarchical Global Attention (HGA) is a drop-in replacement for dense causal attention in pretrained long-context transformers. HGA preserves the original checkpoint parameters: the pretrained $W_Q$, $W_K$, $W_V$, and $W_O$ projections remain unchanged, no calibration parameters are introduced, and no retraining is required. Applied to Qwen3-30B-A3B-Instruct-2507-FP8 on a single RTX~5090 (32GB), the patched model runs out of the box at a 64K-token context, where token-level K/V storage is not feasible on this hardware. Unlike previous sparse-attention methods, HGA performs hierarchical two-level routing. It first retrieves relevant chunks using compact RoPE-aware summaries and then refines the selection by routing only the most relevant groups before performing exact token-level attention. This hierarchical retrieval significantly reduces the number of fetched tokens while preserving exact attention over the retrieved token set, making RAM- and NVMe-backed storage practical. The full historical token K/V resides in host RAM or NVMe storage, while only a small routed working set is transferred to GPU memory during attention. Consequently, GPU memory consumption depends primarily on model weights and the routed working set rather than on the total context length. Across all tested context lengths (4K - 64K tokens), routed attention remains within approximately $0.01$--$0.02$ nats of dense attention while the sparsity used is just about 3%. These results suggest that the approximation introduced by hierarchical routing is small, and that the remaining quality gap is likely dominated by long-context positional encoding rather than by the routing algorithm itself.

Summary

  • The paper presents a hierarchical two-stage routing mechanism that partitions long sequences into chunks and groups to enable efficient exact-token attention with minimal loss gap.
  • It achieves comparable performance to dense attention at 12.5% to 1.9% sparsity, delivering speedups of up to 2.72× in training and 2.43× in inference on constrained hardware.
  • The approach preserves original model parameters and checkpoint compatibility while addressing memory bottlenecks and positional encoding limitations in long-context transformers.

Hierarchical Global Attention: A Drop-In Exact-Token Routing Approach for Long-Context Transformers

Overview and Motivation

Hierarchical Global Attention (HGA) introduces a hierarchical two-stage routing mechanism aimed at overcoming the memory and computational bottlenecks of dense attention in pretrained long-context transformers. The central challenge addressed is the infeasibility of storing the full key/value (K/V) cache for long sequences (e.g., 32K–64K tokens) in GPU memory, particularly for quantized LLMs where model weights alone nearly exhaust typical hardware capabilities. HGA facilitates context lengths far exceeding previously practical limits without requiring retraining, fine-tuning, or modification of existing model parameters, thus enabling direct deployment on pretrained models without model-specific calibration.

Methodology

Hierarchical Chunk-and-Group Routing

HGA partitions the input sequence into fixed-size chunks (e.g., 64 tokens), which are further subdivided into groups. It employs a two-level content-based routing strategy:

  1. Chunk-Level Routing: Chunk summaries, computed using RoPE-aware projections from the original model, represent each chunk. Queries score these summaries to select the most relevant historical chunks, which may reside in CPU or NVMe storage.
  2. Group-Level Routing: Within selected chunks, group summaries further refine which token subsets to retrieve, ensuring efficient yet precise narrowing of the attention window.

This two-level hierarchization ensures that, while the vast context may live outside accelerator memory, only a small, highly-relevant set of K/V tokens are fetched to GPU for exact, dense softmax attention.

Routing Summaries and Compatibility

Summaries used for routing are not trainable parameters; rather, they are direct functions of projected keys (means/sums with mixed RoPE-handling for low/high-frequency dimensions). The approach requires no calibration parameters and leaves all original WQW_Q, WKW_K, WVW_V, and WOW_O projections, as well as normalization layers, unaltered.

Deterministic & Content-Based Visibility

Besides routed chunks/groups, HGA deterministically includes fixed "sink" chunks (initial sequence), a sliding window of recent local chunks, and the currently processed chunk—ensuring baseline coverage for causality and local context akin to diverse static sparse attention schemas.

Storage and System Implications

The implementation supports a tiered K/V cache management: "hot" (always-resident) chunks/summaries on GPU, a "warm" LRU cache for frequently accessed chunks, and "cold" historical K/V storage in CPU or secondary storage. This ensures that GPU memory footprint depends on model weights and routed working sets, decoupled from total context length.

Results

Loss Gap and Sparsity Analysis

Experimental results demonstrate that at context lengths up to 64K, HGA achieves validation losses within approximately 0.01–0.02 nats of dense attention, using as little as 3–12% of the token pairs for attention. The loss gap does not scale rapidly with increasing sequence length, indicating robustness of the hierarchical routing design.

  • For Qwen3-30B-A3B-Instruct-2507-FP8 at 32K tokens, the out-of-the-box HGA loss gap is <0.01<0.01 nats at 12.5% sparsity.
  • For 40M SmallLM, direct weight copy yields a +0.018+0.018 nat gap at 8K tokens.
  • In needle-in-a-haystack evaluations at 64K tokens, HGA achieves 100% retrieval accuracy with only 1.9% sparsity.

Speed and Scalability

HGA enables major throughput improvements:

  • For a 40M model, HGA offers 2.72× speedup in training and 2.43× speedup in inference over the dense baseline at 12K tokens.
  • The system supports large FP8-quantized models (e.g., Qwen3-30B) at 32K-64K contexts on mainstream 32GB GPUs, which is infeasible using dense attention.

Fine-tuning and Correctness

Fine-tuning with HGA yields a trivial quality penalty (∼\sim0.015 nats loss gap) and preserves the compatibility of dense and HGA-trained checkpoints, confirming that hierarchical routing does not degrade underlying model representations. Correctness checks validate that, with full coverage, routed attention is numerically equivalent to dense SDPA, confirming soundness of the hierarchical router.

Interaction with Positional Encoding

A significant empirical insight is that the remaining validation loss gap arises predominantly from the limitations of positional encoding over long contexts (e.g., RoPE extrapolation artifacts), not from routing sparsity itself. Modifications such as RoPE index wrapping reduce this gap, highlighting the importance of positional representations in the sparse attention regime.

Implications and Theoretical Considerations

HGA's ability to decouple memory usage from context length, while maintaining nearly full-quality retrieval and attention, carries implications for broad deployment of pretrained LLMs on constrained hardware. The method’s strict adherence to checkpoint compatibility positions it as a systems-level solution, synergistic with techniques such as YaRN for context extension and downstream task adaptation via fine-tuning.

From a theoretical perspective, the results imply that substantial computational and memory savings are attainable in attention without sacrificing model fidelity, provided that learned content-relevance and positional alignment are well preserved. The two-level routing scheme, leveraging inherent redundancy within local sequence segments, dramatically reduces the effective size of the token working set for attention.

Limitations and Future Directions

Limitations include the risk of missing relevant tokens not selected by the probabilistic router, constraints imposed by the coverage of chunk/group summaries, and the current evaluation’s focus on a limited set of retrieval and language modeling tasks. Notably, HGA’s performance on contexts exceeding typical pretraining ranges, tasks with unusual retrieval requirements, and its interaction with context extension methods like YaRN deserve comprehensive study.

Potential avenues for future research include:

  • Optimizing the interaction between hierarchical routing and extended positional encodings,
  • Implementing adaptive or uncertainty-aware route budgets,
  • Investigating lightweight trainable routing summaries for tighter token selection,
  • Extending systematic evaluation to broader benchmarks, including in-depth memory-bandwidth and latency analyses,
  • Scaling to ultra-long contexts (128K tokens and beyond).

Conclusion

Hierarchical Global Attention provides an effective and pragmatic mechanism for deploying long-context transformers at scale without retraining or model recalibration. By architecting an exact-token, hierarchical content-based routing system that is fully compatible with original attention projections, HGA substantially narrows the gap between dense and sparse attention. The residual limitations are shown to be primarily positional-encoding related, rather than stemming from the routing methodology itself. HGA thereby delineates a promising trajectory for efficient, scalable transformer deployment on contemporary hardware, and serves as a foundational abstraction for future advancements in memory-efficient long-context inference.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 15 tweets with 5200 likes about this paper.