Efficiently Scaling Dynamic Sparse Attention in Distributed Training
Determine algorithmic and system-level methods to scale dynamic sparse attention efficiently in distributed training of Transformer-based large language models.
References
However, scaling dynamic sparse attention efficiently in distributed training remains an open problem.
— MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
(2510.18830 - Li et al., 21 Oct 2025) in Related Work, Efficiency Enhancement for Long-Context LLMs