Experimental validation of scaling PP/TP for very large models and CP for very long sequences

Experimentally validate the expected benefits of further increasing pipeline or tensor parallelism for extremely large models and increasing context parallelism for very long sequences, quantifying their impact on Model FLOPs Utilization, throughput, and memory usage in large-scale settings.

Background

The empirical study identifies best and worst hybrid strategies for LLaMA and Mamba at 1B and 7B scales. The authors suggest that larger degrees of model parallelism (PP/TP) should help for extremely large models, and that higher context parallelism (CP) should help for very long sequences. However, they explicitly state that due to resource limitations, these scenarios have not been experimentally validated and are left for future work.

References

Further increases in the degree of model parallelism (PP/TP) are expected to benefit extremely large models. Increasing context parallelism (CP) is expected to be advantageous for very long sequences; however, due to resource limitations, experimental validation of these scenarios is left for future work.

Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide  (2602.09109 - Amer et al., 9 Feb 2026) in Section 6.4, Summary of Empirical Insights (end of section)