Combine Sequence Parallelism with Expert Parallelism for Sparse (MoE) Models

Develop an approach that integrates Ulysses Sequence Parallelism (SP) with Expert Parallelism (EP) for sparse mixture-of-experts transformer models used in inference, in order to further optimize performance and scalability of these sparse models.

Background

The paper introduces Shift Parallelism, which dynamically switches between Tensor Parallelism (TP) and Sequence Parallelism (SP) to achieve low latency and high throughput for dynamic inference workloads. While the work generalizes SP for inference and demonstrates benefits on dense and some sparse (MoE) models, it highlights that sparse models present additional opportunities when considering expert parallelism (EP).

In the Limitations and Future Work section, the authors explicitly note the absence of prior work that combines SP with EP and state that this integration is left for future work, identifying it as a concrete open direction to further optimize sparse models.

References

Specifically, there is no prior work that combines SP with EP to to further optimize sparse models, which we will leave as a future work.

— Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads (2509.16495 - Hidayetoglu et al., 20 Sep 2025) in Limitations and Future Work

Combine Sequence Parallelism with Expert Parallelism for Sparse (MoE) Models

Sponsor

Background

References

Related Problems