- The paper introduces HyLo, a hybrid architecture that upcycles pretrained Transformers by interleaving multi-head latent attention with efficient linear sequence modules.
- It leverages staged distillation and long-context fine-tuning to extend support up to 2M tokens while maintaining robust performance on established benchmarks.
- Empirical analysis reveals that HyLo models achieve superior long-context generalization and memory efficiency compared to pure Transformer models.
Long-Context Aware Upcycling: Hybrid LLM Scaling via HyLo
Motivation and Problem Setting
The exponential increase in both model and context size in LLMs has resulted in a prohibitive computational footprint for training and inference, especially as real-world applications demand ever-larger context windows. While recent hybrid architectures combining Transformer-based attention with efficient sequence modules (e.g., state space models or linear attention) show promise for long-sequence modeling, prior approaches primarily require expensive pretraining from scratch. Existing upcycling work converts pretrained Transformers into hybrid formats, but predominantly focus on short-context preservation and often underexplore systematic context-length extension or practical deployment challenges.
HyLo Architecture and Upcycling Methodology
HyLo ("HYbrid LOng-context") addresses these limitations through an upcycling-centric recipe that enables conversion of pretrained Transformer LLMs into hybrid long-context models without discarding learned representations. The architecture interleaves Multi-Head Latent Attention (MLA) layers with efficient linear sequence modules—either Mamba-2 (M2) or Gated DeltaNet (GDN). MLA layers offer attention-based expressivity but with reduced KV-cache cost via low-rank projections, while the linear modules afford subquadratic memory and compute complexity.
Importantly, HyLo generalizes across both Llama and Qwen backbones and supports both M2 and GDN as the linear block. The MLA-to-linear ratio flexibly tunes the speed-accuracy tradeoff and determines runtime KV-cache utilization.
Initialization strategy leverages prior SVD-based decompositions of attention weights for MLA and M2 (with novel adaptations for GDN), allowing seamless module replacement and parameter transfer from the Transformer teacher.
Long-Context Training and Distillation Pipeline
HyLo's training recipe is staged:
- Stage I: Enhanced Intermediate-Layer Distillation (Enhanced-ILD)
- Pure MLA, M2, or GDN models receive per-layer distillation, now augmented to also align token-mixer outputs, strengthening the transfer from full-attention to hybrid tokenspace representations.
- Stage II: Long-Context Supervised Fine-Tuning
- Assembled hybrids (from Stage I) are fine-tuned on progressively lengthened contexts (8K up to 64K tokens and beyond), with Kullback-Leibler (KL) distillation from the full attention teacher. This step is critical for robust long-context generalization.
- Memory-Efficient Knowledge Distillation
- To handle the severe memory bottleneck (e.g., storing logits at 64K+ contexts), a suite of optimizations is applied: chunked KL over sequence, Triton-fused cross-entropy kernels, logit-free loss via hidden state KL, activation checkpointing, mixed precision, and aggressive parameter sharding with FSDP. This enables 32× context extension (2K→64K) with large teachers (up to 8B) on commodity multi-GPU setups.
Deployment: vLLM Integration and Inference Efficiency
HyLo is integrated into the vLLM inference framework, with key runtime adaptations:
- Unified execution across attention (MLA) and linear (Mamba/GDN) layers, including dynamic per-layer cache management.
- Support for compressed latent KV caches and layer/embedding heterogeneity not previously available in vLLM serving.
- These changes result in practical serving for up to 2M-token context windows with a 90%+ reduction in KV cache memory footprint relative to pure Transformers.
Empirical results show that Llama-based baselines exhaust GPU memory above 64K tokens, while HyLo hybrids maintain both prefill and decode throughput, extending direct support to 2M contexts with minimal latency proliferation.
Experimental Analysis
Short- and Long-Context Metrics. Across Llama-3.2-1B/3B and Qwen3-1.7B models, HyLo variants maintain short-context performance competitive with pure-Transformer and leading upcycled baselines on benchmarks such as ARC, HellaSwag, and GSM8K. The performance drop for commonsense reasoning at longer context training is minimal.
Long-context generalization is robustly superior: On the RULER benchmark, HyLo maintains higher accuracy at 32K/64K context and sustains performance as length increases, substantially outperforming JetNemotron and Zebra-Llama at similar model scales. Notably, HyLo-Qwen-1.7B, trained on only 10B tokens, outperforms JetNemotron (trained on 400B tokens) on GSM8K and RULER-64K.
Ablation Studies.
- Enhanced-ILD Distillation: Provides stable gains for both common sense and mathematical reasoning benchmarks compared to standard ILD.
- Long-context training (rather than only position interpolation) is necessary for performance retention at 64K+ tokens.
- Knowledge distillation from larger teachers further boosts long-context extrapolation, with efficacy scaling with teacher size.
- Architecture variants: NoPE and learnable gated attention, effective when applied during pretraining, do not transfer their benefits in the upcycled hybrid context.
Implications and Future Prospects
HyLo demonstrates that model upcycling is not limited to short-context retention—it can be made explicitly long-context aware, combining cost-effectiveness with the scalability and efficiency needed for real-world, production-scale LLM deployments. The hybrid design offers flexible control over the compute/memory trade-off, and the staged training protocol expands the operational footprint to multi-million-token windows on commodity hardware. The comprehensive methodological and systems advances suggest the following implications:
- In research, explicit long-context evaluation and optimization should become standard for LLM upcycling frameworks. The benefit of staged long-context-aware distillation over simple contrastive/pseudoperplexity metrics is clear.
- In practice, the ability to serve multi-million-token contexts without OOM is critical for document understanding, multi-hop retrieval, and reasoning at web scale, enabling new application classes previously unattainable with pure Transformers.
- Architectural co-design between attention, SSM, and other efficient sequence modules will likely see continued development. The hybrid framework is amenable to future inclusion of retrieval-augmented modules, dynamic layer selection, and adaptive memory usage.
- Distillation efficiency—especially for extremely long contexts—remains a bottleneck. Solutions incorporating teacher sampling, further logit-free training, or highly parallelized memory management will be active areas of exploration.
Conclusion
HyLo presents a modular, memory-efficient upcycling paradigm for transforming pretrained Transformer LLMs into high-performing, scalable hybrids with explicit long-context awareness. The recipe brings together careful architecture design (MLA + Mamba2/GDN), initialization, staged distillation, and engineering solutions for long-sequence inference. These contributions collectively enable LLM serving and reasoning capabilities at scale and with context lengths unachievable by standard Transformer architectures, narrowing the gap between research LLMs and production-grade, long-context deployments (2604.24715).