HSA-UltraLong: Ultra-Long Context LLM
- HSA-UltraLong is an 8B-parameter Mixture-of-Experts LLM that leverages Hierarchical Sparse Attention for efficient ultra-long context processing.
- It replaces quadratic attention with a two-level sparsity mechanism, significantly reducing computational cost while enabling accurate retrieval from 16M tokens.
- The architecture integrates standard and MoE layers with a multi-phase curriculum, ensuring robust in-domain performance and effective reasoning over extended contexts.
HSA-UltraLong is an 8B-parameter Mixture-of-Experts (MoE) decoder architecture for LLMs, integrating the Hierarchical Sparse Attention (HSA) mechanism to yield efficient ultra-long context modeling up to 16 million tokens. Developed as a foundation for “Machines that Can Remember,” HSA-UltraLong achieves near-linear compute scaling with respect to context length while preserving strong in-domain performance and high-accuracy retrieval from ultra-long contexts (Hu et al., 28 Nov 2025).
1. Hierarchical Sparse Attention (HSA) Mechanism
The HSA mechanism supersedes conventional quadratic attention by enforcing a two-level sparsity pattern, addressing the core requirements of ultra-long context modeling: sparsity, random-access flexibility, and length generalization. Contexts of length are partitioned into non-overlapping fixed-size chunks of tokens. Each chunk is associated with:
- A learned landmark key vector ,
- Per-chunk KV cache of shape (with heads and ).
For each decoding position , a retrieval query computes scores by dot product with the landmark keys of all preceding chunks, normalized by . The top- scoring chunk indices are selected via , providing random-access selection.
Intra-chunk multi-head attention is performed between and the KV cache of each selected chunk. The outputs are fused by retrieval score-based weights , yielding the final token output as .
This construction ensures that each token attends only to chunks (sparsity), can randomly access arbitrary historical information (random-access flexibility), and, with NoPE positional encoding, generalizes retrieval behavior across context lengths during inference (length generalization).
Computational complexity is reduced from (traditional attention) to per token, which results in near-linear scaling for .
2. Model Architecture and Integration
HSA-UltraLong is built as an -layer decoder divided into two sequential halves:
- Lower decoder ( layers): Standard Transformer blocks with Sliding-Window Attention (SWA), using window size up to 4K tokens and Rotary Positional Embedding (RoPE). MLPs are dense.
- Upper decoder ( layers): Organized into groups, each consisting of:
- One hybrid layer with both SWA and an HSA module (HSA operates with NoPE),
- Followed by additional SWA-only layers.
At the midpoint (), chunk hidden states are bi-encoded (with prepended [CLS]) to produce embeddings and landmark vectors , reused as chunk representations in all HSA layers.
MLP blocks in layers $2$ and higher are replaced by MoE blocks following the Ling-2.0 / DeepSeek-V3 protocol, with 64 experts per layer, 4 experts activated per token (top-4 gating), halved expert dimension to maintain overall parameter budget (8B, with only 1B active per token). A shared “global expert” operates across all tokens to stabilize training. Token-expert allocation is balanced by an auxiliary, training-free loss function.
3. Training Regimen and Data Pipeline
Training proceeds in multi-phase curriculum:
- Phase 1: Pretraining on 10T tokens (multi-domain, deduplicated)
- Web 50%, Code 14.4%, Math 12%, Code-NLP 5.6%, Reasoning 5%, Multilingual 4%, Books 2%, Wikipedia 1.5%, Others 5.5%
- Dense variant: 4T tokens, MoE variant: 8T tokens
- Phase 2: Long-text mid-training
- 175B tokens (sequence length 32K)
- Phase 3: Reasoning-focused training
- 400B tokens
Supervised fine-tuning (SFT) uses 8K context data per Wu et al. (2025).
Curricular schedule:
- Warm-up: 16K windows, SWA=512, large HSA top-; synthetic RULER retrieval tasks introduced until reliable in-window retrieval.
- LM pretraining: 16K window, SWA=4K, reduced HSA top-.
- Long-context adaptation: 32K window, increased HSA top- for improved long-range retrieval adaptation.
- Annealing: on high-quality data, 32K window.
- SFT: 8K window.
Optimizers employ AdamW (, , weight decay = 0.01, ), with learning rates and batch sizes set separately for MoE and dense pretraining. LR follows a schedule: linear warm-up, constant, cosine decay (SFT).
4. Empirical Evaluation and Key Results
Performance is validated on an array of in-domain and ultra-long context benchmarks:
- In-domain (32K): MMLU, CMMLU, C-Eval, ARC-C, AGIEval, HellaSwag, PIQA, BBH, GSM8K, MATH, CMATH, MATH-500, OlympiadBench, HumanEval, HumanEval+, MBPP, MBPP+, CRUX-O, IFEval
- Ultra-long (up to 16M): Needle-in-a-Haystack variants, Variable Tracking in RULER
Selected Quantitative Results
| Model Variant | In-domain AVG (%) | Ultra-long MQ-NIAH @16M (%) | RULER Variable Tracking @16M (%) |
|---|---|---|---|
| HSA-UL-MoE 8B/1B | 63.1 | >95 | ~85 |
| TRM-MoE baseline | 56.6 | N/A | N/A |
| Qwen2.5/0.5B | 41.1 | N/A | N/A |
| Qwen3/0.6B | 48.4 | N/A | N/A |
| Dense-0.5B | N/A | 0 (@1M MQ-NIAH) | ~50 |
In small-scale experiments (0.5B PG19), SWA+HSA with self-copy warm-up closes the gap in last-4K perplexity (16.5 vs 16.8 full-attention baseline) and yields MQ-NIAH accuracy of 93% at both 64K and 1M, compared to 5%/0% for standard BaseLM.
Ultra-long retrieval performance exceeds 90% on Single-NIAH and >95% on Multi-Query NIAH at 16M tokens, maintained at scale after mid-training and annealing. Variable Tracking shows HSA-UL-MoE sustaining ~85% at 16M tokens, while dense models drop to ~50%.
On computational efficiency, HSA kernels surpass FlashAttention-3 in speed for contexts 32K, though lag at contexts <8K.
5. Empirical Insights and Open Challenges
Empirical findings indicate that initial warm-up is critical to teaching HSA short-range retrieval prior to training SWA on large windows. A notable “seesaw” interaction arises: excessively wide SWA windows attenuate gradient signal for HSA retrieval, impeding extrapolation to longer contexts. Effective training demands early inclusion of long contiguous passages (>32K) in data.
Model scaling improves the reasoning+retrieval composite, with MoE-8B outperforming smaller dense models.
Open Problems
- Supervised fine-tuning can degrade extrapolation if SWA windows are overly large; this seesaw effect highlights a delicate balance in architectural scheduling.
- HSA’s current implementation demands a 16:1 query/KV head ratio, imposing bottlenecks; kernel-level optimization is required.
- At short contexts (<8K), custom HSA kernels are slower than FlashAttention-3, necessitating further engineering.
- Promising directions include adaptive chunk sizing, dynamic selection, and hybrid routing (combining learned and data-driven mechanisms) to advance scalable infinite-length memory.
6. Significance and Context
HSA-UltraLong represents an overview of end-to-end sparse retrieval, hierarchical attention structure, and large-scale MoE, addressing the computational and functional challenges of ultra-long context LLMs. The model’s approach—hierarchical chunk retrieval followed by intra-chunk attention and output fusion—effectively replaces quadratic attention overhead with a mechanism characterized by sparsity, differentiable random-access, and capacity for length generalization.
This architecture provides a practical and principled solution to memory- and retrieval-centric AI, enabling LLMs to scale memory footprint and context reasoning to unprecedented lengths without sacrificing in-domain performance or computational tractability. Its open challenges suggest further avenues for scalable memory in deep learning, including optimization at both kernel and curriculum levels (Hu et al., 28 Nov 2025).