Dice Question Streamline Icon: https://streamlinehq.com

Efficiently Scaling Dynamic Sparse Attention in Distributed Training

Determine algorithmic and system-level methods to scale dynamic sparse attention efficiently in distributed training of Transformer-based large language models.

Information Square Streamline Icon: https://streamlinehq.com

Background

Dynamic sparse attention has demonstrated substantial efficiency gains for long-context processing, especially at inference time, and recent work has begun to explore its use during pretraining. However, when moving to distributed training, particularly with context parallelism and ring-attention-style communication, practical obstacles such as worker imbalance and communication overhead complicate achieving efficient scaling.

The paper positions MTraining as a co-designed algorithm–system approach to address these issues, but explicitly notes that, in general, efficiently scaling dynamic sparse attention in distributed training remains unresolved in the literature, motivating the presented techniques and empirical paper.

References

However, scaling dynamic sparse attention efficiently in distributed training remains an open problem.

MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training (2510.18830 - Li et al., 21 Oct 2025) in Related Work, Efficiency Enhancement for Long-Context LLMs