Consistent timeout coordination in dynamic, fault-tolerant distributed training
Develop a principled, system-wide timeout coordination mechanism for the prime framework’s dynamic distributed training with parallel TCP stores and replica groups to ensure consistent timeout behavior across all process groups and components during partial node failures and dynamic world-size changes, thereby preventing desynchronization and undefined behavior.
References
Current timeout values are set empirically based on observed network latencies and failure detection windows, but maintaining consistent timeout behavior across all components remains an open challenge.
— INTELLECT-1 Technical Report
(2412.01152 - Jaghouar et al., 2 Dec 2024) in Subsubsection 'Retries, Timeouts and Edge Cases' under 'Fault Tolerance and Dynamic Node Management' in Section 2 'Prime Framework: Enabling Scalable Decentralized Training'