Papers
Topics
Authors
Recent
2000 character limit reached

ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training (2510.00606v1)

Published 1 Oct 2025 in cs.DC

Abstract: Large-scale LLM pretraining today spans $10{5}$--$10{6}$ accelerators, making failures commonplace and elasticity no longer optional. We posit that an elastic-native training system must simultaneously ensure (i) Parameter Consistency, (ii) low Mean Time to Recovery (MTTR), (iii) high post-change Throughput, and (iv) Computation Consistency. This objective set not has never been jointly attained by prior work. To achieve these goals, we present ElasWave, which provides per-step fault tolerance via multi-dimensional scheduling across Graph, Dataflow, Frequency, and Random Number Generation. ElasWave resizes and reshards micro-batch workloads while preserving the global batch size and gradient scale; it performs online pipeline resharding with asynchronous parameter migration, interleaving ZeRO partitions so recovery reduces to disjoint rank-to-rank transfers. It further uses DVFS to absorb pipeline bubbles and reshards RNG to keep consistent computations. A dynamic communicator enables in-place communication group edits, while per-step in-memory snapshots support online verification and redistribution. We evaluated ElasWave on 96 NPUs and benchmarked against state-of-the-art baselines: throughput improves by $1.35\times$ over ReCycle and $1.60\times$ over TorchFT; communicator recovery completes within one second (up to $82\times/3.6\times$ faster than full/partial rebuilds); migration MTTR drops by as much as $51\%$; and convergence deviation is reduced by approximately $78\%$.

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.