Papers
Topics
Authors
Recent
2000 character limit reached

FlexDeMo: Decoupled Momentum Optimization for Hybrid Sharded Data Parallel Training (2502.06728v2)

Published 10 Feb 2025 in cs.LG and cs.AI

Abstract: Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, when considering larger models that do not fit on a single accelerator, the exchange of gradient information and the integration of DeMo needs to be reconsidered. Here, we propose employing a hybrid sharded data parallel training strategy, FlexDeMo, whereby nodes fully shard model parameters locally between different accelerators, while inter-node communication bandwidth requirements are reduced by synchronizing only fast-moving components instead of the full gradients. This effectively combines previous hybrid sharded strategies with the advantages of decoupled momentum. Our experimental results show that FlexDeMo is on par with hybrid sharded data parallel training employing AdamW and full gradient synchronization in terms of validation loss, demonstrating its viability. Furthermore, FlexDeMo achieves improved training speed compared to full gradient synchronization across nodes. In a bandwidth-constrained 2-node setup, FlexDeMo allows reaching desired levels of validation loss faster than hybrid sharded data parallel training with full gradient synchronization.

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.