MuLoCo: Muon is a practical inner optimizer for DiLoCo (2505.23725v1)

Published 29 May 2025 in cs.LG

Abstract: DiLoCo is a powerful framework for training LLMs under networking constraints with advantages for increasing parallelism and accelerator utilization in data center settings. Despite significantly reducing communication frequency, however, DiLoCo's communication steps still involve all-reducing a complete copy of the model's parameters. While existing works have explored ways to reduce communication in DiLoCo, the role of error feedback accumulators and the effect of the inner-optimizer on compressibility remain under-explored. In this work, we investigate the effectiveness of standard compression methods including Top-k sparsification and quantization for reducing the communication overhead of DiLoCo when paired with two local optimizers (AdamW and Muon). Our experiments pre-training decoder-only transformer LLMs (LMs) reveal that leveraging Muon as the inner optimizer for DiLoCo along with an error-feedback accumulator allows to aggressively compress the communicated delta to 2-bits with next to no performance degradation. Crucially, MuLoCo (Muon inner optimizer DiLoCo) significantly outperforms DiLoCo while communicating 8X less and having identical memory complexity.

PDF Abstract

This paper, "MuLoCo: Muon is a practical inner optimizer for DiLoCo" (Thérien et al., 29 May 2025 ), investigates methods to reduce the significant communication overhead that remains in the DiLoCo framework for training LLMs. DiLoCo [douillard2023diloco] is a communication-efficient distributed training algorithm that allows workers to perform multiple local optimization steps using an "inner optimizer" before synchronizing their model updates via an "outer optimizer". While this reduces communication frequency compared to standard data-parallel training (where communication happens every step), the data size communicated remains large (a full copy of model parameters or their delta).

The authors explore reducing the size of the communicated data in DiLoCo by applying standard compression techniques: Top- $k$ sparsification and quantization. A key focus is the interaction between the choice of the inner optimizer and the effectiveness of compression, especially when combined with error feedback [karimireddy2019error]. Error feedback is a technique where the information lost during compression is accumulated locally on each worker and added back to the next update before compression, which helps maintain convergence quality.

The paper proposes MuLoCo, which adapts the DiLoCo framework by using Muon [jordan2024muon] as the inner optimizer instead of the standard AdamW. Muon is a recently proposed optimizer that has shown competitive or superior performance compared to AdamW in LLM pre-training. The hypothesis is that the different structure of Muon's updates, which involve orthogonalization, might make them more amenable to compression.

The proposed MuLoCo algorithm follows the general structure of DiLoCo:

Workers perform $H$ local optimization steps using the Muon inner optimizer.
Each worker computes the delta (difference) between the model parameters before and after the local steps.
(Modification) If error feedback is enabled, the previous error accumulator is added to the delta.
(Modification) The (potentially error-corrected) delta is compressed using quantization or Top- $k$ sparsification.
(Modification) If error feedback is enabled, the difference between the original and compressed delta is stored in the error accumulator.
The compressed deltas are all-reduced across workers (e.g., averaged).
The outer optimizer (SGD with Nesterov momentum, following standard DiLoCo practice) updates the global model parameters based on the averaged compressed delta.

The empirical evaluation is conducted by pre-training a 220M parameter transformer LLM on the FineWeb-EDU dataset [lozhkov2024fineweb-edu] using 8 workers and 30 local steps per communication round. The paper compares:

Standard DiLoCo (AdamW inner optimizer) vs. MuLoCo (Muon inner optimizer).
Variants with and without error feedback.
Different compression levels (varying Top- $k$ percentage, 2-bit, 4-bit, 8-bit quantization).
Comparison against communication-heavy data-parallel training baselines using AdamW and Muon.

Key Findings and Practical Implications:

Inner Optimizer Choice Matters: Without compression, MuLoCo converges faster and to a lower loss than standard DiLoCo, matching the performance of the communication-heavy Muon data-parallel baseline (while communicating $H=30$ times less). This suggests Muon is a highly effective inner optimizer for DiLoCo.
Error Feedback is Crucial for Compression: Error feedback consistently improves the performance of both MuLoCo and DiLoCo when compression is applied. Without error feedback, performance degrades significantly with increasing compression.
Muon Enables Aggressive Quantization: MuLoCo demonstrates superior resilience to aggressive quantization compared to DiLoCo. Specifically, using 2-bit quantization combined with error feedback in MuLoCo achieves performance comparable to the uncompressed (16-bit float) baseline. In contrast, DiLoCo experiences a notable performance degradation at 2-bit quantization even with error feedback.
Significant Communication Savings: The most compelling result is that MuLoCo with 2-bit quantization and error feedback achieves a lower final loss than standard (uncompressed) AdamW-DiLoCo while communicating 8 times less data per communication step.
Memory Efficiency: MuLoCo with error feedback has memory complexity identical to standard AdamW DiLoCo without error feedback. Both require storing accumulators (e.g., Muon's momentum + error feedback accumulator vs. AdamW's two moments + delta accumulator for the outer step), summing up to about 2x parameters per worker for DiLoCo+EF / MuLoCo+EF, and 3x parameters for AdamW-DiLoCo+EF (Table 1). Standard AdamW DiLoCo needs 2x parameters (AdamW moments).

Implementation Considerations:

Optimizer Implementation: Requires implementing or integrating Muon, which differs from standard AdamW. Muon updates involve orthogonalization steps (like Newton-Schulz algorithm), which might have different computational characteristics than AdamW's element-wise operations.
Compression Implementation: Quantization requires implementing efficient methods to map parameters to a small set of discrete values and encode them (e.g., using offsets and scale). Top- $k$ sparsification involves identifying the largest magnitude elements and communicating their values and indices.
Error Feedback Accumulator: Each worker needs to maintain an additional accumulator the size of the model parameters to store the accumulated compression error. This adds memory overhead but is shown to be essential for effective compression.
Communication Protocol: The communication step needs to handle the compressed delta, including potentially variable sizes (for Top- $k$ ) or fixed-size low-bit representations (for quantization). The all-reduce operation must work correctly on these compressed representations.
Hyperparameter Tuning: Beyond standard optimizer hyperparameters, tuning involves the number of local steps ( $H$ ), compression parameters (e.g., $k$ for Top- $k$ , number of bits for quantization), and the error feedback coefficient ( $\beta$ ). The paper provides some tuned values for their specific setup (Table 3, Table 4).
Trade-offs: While communication is significantly reduced, the computational cost per communication round increases due to more local steps. The choice of $H$ involves balancing local computation time against communication time. The overhead of compression/decompression is typically small compared to the time spent on local gradient computation.

In summary, MuLoCo is presented as a practical advancement for communication-efficient LLM training. By combining the Muon optimizer with delta compression and error feedback within the DiLoCo framework, it achieves substantial communication savings (8x less data communicated at 2-bit quantization) while maintaining or improving convergence performance and having comparable memory requirements to existing DiLoCo variants. This makes it a promising approach for training large models in distributed environments with limited network bandwidth.