Consistent timeout coordination in dynamic, fault-tolerant distributed training

Develop a principled, system-wide timeout coordination mechanism for the prime framework’s dynamic distributed training with parallel TCP stores and replica groups to ensure consistent timeout behavior across all process groups and components during partial node failures and dynamic world-size changes, thereby preventing desynchronization and undefined behavior.

Background

The prime framework supports fault-tolerant, dynamic node participation using data-parallel training across unreliable internet connections. To reduce congestion, the system employs multiple parallel TCP stores, each serving a replica group, but this architectural choice complicates global synchronization.

When failures affect only a subset of replica groups, processes on the same physical node can become desynchronized, leading to misaligned timeout windows and potentially undefined behavior. The authors currently set timeout values empirically and identify maintaining consistent timeout behavior across components as an unresolved issue.

References

Current timeout values are set empirically based on observed network latencies and failure detection windows, but maintaining consistent timeout behavior across all components remains an open challenge.

— INTELLECT-1 Technical Report (2412.01152 - Jaghouar et al., 2 Dec 2024) in Subsubsection 'Retries, Timeouts and Edge Cases' under 'Fault Tolerance and Dynamic Node Management' in Section 2 'Prime Framework: Enabling Scalable Decentralized Training'

Consistent timeout coordination in dynamic, fault-tolerant distributed training

Sponsor

Background

References

Related Problems