Hierarchical Global Attention: Scaling Transformers to 64K Tokens
Hierarchical Global Attention (HGA) introduces a two-stage routing mechanism that enables pretrained transformers to handle context lengths of 32K to 64K tokens on standard GPUs without retraining. By partitioning sequences into chunks and groups, then selecting only the most relevant 3-12% of tokens for exact attention, HGA achieves near-dense quality while fitting massive contexts into constrained hardware. The approach demonstrates that transformer memory bottlenecks can be overcome through intelligent content-based routing rather than model redesign.Script
A 30 billion parameter language model running on a single 32 gigabyte GPU processing 64,000 tokens at once. Until now, that was impossible because the attention mechanism's memory demands scaled with sequence length squared, exhausting hardware long before reaching useful context sizes.
Hierarchical Global Attention solves this by organizing tokens into 64 token chunks, then routing attention in two stages. First, chunk summaries help queries identify which historical chunks matter. Then, within chosen chunks, group summaries narrow the search further to specific token subsets.
Only the selected tokens are fetched to GPU memory for exact attention computation. The rest live in CPU memory or even secondary storage, with a tiered cache managing what stays hot. This decouples memory usage from total context length entirely.
At 32,000 tokens, HGA achieves validation loss within 0.01 nats of dense attention while using only 12% of token pairs. At 64,000 tokens in needle-in-a-haystack tests, it retrieves with 100% accuracy using just 1.9% sparsity. The method requires no retraining and leaves all original model weights untouched.
The remaining quality gap traces not to the routing itself, but to positional encoding struggling over long distances. HGA checkpoints remain fully compatible with dense attention, and the router uses no trainable parameters. It is purely a systems-level solution that respects the original model's learned representations.
Hierarchical Global Attention demonstrates that memory bottlenecks in transformers are not fundamental limits, but design choices we can route around. To explore more breakthroughs in efficient AI and create your own video explanations, visit EmergentMind.com.