Block-Periodic Orthogonalization (MuonBP)

Updated 26 October 2025

Block-Periodic Orthogonalization (MuonBP) is a communication-efficient method that applies local blockwise orthogonalization with periodic global reorthogonalization to minimize synchronization overhead.
It leverages distinct learning rates for block and global steps, aligning with theoretical spectral norms to maintain numerical stability in large-scale distributed training.
MuonBP achieves up to an 8% throughput gain over full global orthogonalization without compromising model quality, making it ideal for optimizing billion-parameter neural networks.

Block-Periodic Orthogonalization (MuonBP) is a communication-efficient optimization technique for large-scale distributed deep learning, in which orthogonalization is applied independently to parameter matrix shards on each device for most iterations, and periodically a full global orthogonalization is performed. This procedure, closely studied in recent works, balances the throughput advantages of blockwise (local) operations with the numerical benefits of global orthogonalization, offering both practical and theoretical advances in training large neural networks under model-parallel settings (Khaled et al., 19 Oct 2025).

1. Motivation and Background

Gradient orthogonalization has become a valuable tool for enhancing the data efficiency of gradient-based optimization. The Muon optimizer, introduced by Jordan, Jin et al. (2024), applies a form of gradient orthogonalization derived from a Non-Euclidean Trust Region framework, outperforming coordinate-wise optimizers (e.g., AdamW) on large-scale LLM training in terms of convergence and final model quality. However, for model-parallel architectures where parameter matrices (and accordingly gradients and optimizer state) are distributed across devices, the naive full-matrix orthogonalization step incurs significant communication overhead: gradients from all shards must be gathered and orthogonalized, then scattered back, leading to a throughput reduction of 5–10% compared to Adam/AdamW (Khaled et al., 19 Oct 2025).

MuonBP is introduced to mitigate this bottleneck. The main idea is to align the orthogonalization granularity with the tensor-parallel or data-parallel device layout: for most optimization steps, each device orthogonalizes only its local block (or shard) of the gradient or optimizer buffer. Global orthogonalization, which requires communication, is performed only every $P$ iterations, where $P$ is typically set to a small integer (e.g., $P=5$ ). This approach leverages prior work on block (and block-periodic) orthogonalization in the context of distributed QR and Gram-Schmidt methods, particularly the communication-efficient block variants with periodic reorthogonalization (Carson et al., 2 May 2024, Carson et al., 19 Aug 2024).

2. Core Algorithmic Framework

Let $G_t$ denote the gradient matrix (or, in momentum-based variants, the momentum buffer) at step $t$ . In model-parallel environments, $G_t$ is partitioned across $n$ devices, such that each device holds a block $G_t^{(i)}$ .

MuonBP alternates between two types of steps:

Blockwise Step ( $t \bmod P \neq 0$ ): Each device independently orthogonalizes its local $G_t^{(i)}$ without gathering data from other devices; this uses a local polynomial or iterative procedure (e.g., Newton–Schulz, SVD-based, or fixed-step polynomial mapping).
Global Step ( $t \bmod P = 0$ ): All devices gather and concatenate their $G_t^{(i)}$ to form the full $G_t$ , perform global orthogonalization (across the aggregated gradient), then redistribute resulting blocks back to respective devices.

Mathematically, the orthogonalization operator can be represented (informally) as

$\operatorname{Orth}(G) = (GG^\top)^{-1/2} G,$

where the matrix square root and inversion are usually approximated via iterative or polynomial procedures for computational tractability (Khaled et al., 19 Oct 2025).

To optimize convergence, MuonBP employs two distinct step sizes: one for blockwise steps (local norm), and one for global steps (operator norm), tuned according to theory-derived RMS scaling (Khaled et al., 19 Oct 2025). Hyperparameter transfer from full Muon to MuonBP principally involves adjusting these two learning rates.

3. Theoretical Analysis and Performance Guarantees

The convergence behavior of MuonBP is captured by a bound interpolating between full Muon (global orthogonalization at every step) and blockwise-only versions (BlockMuon). Let $P$ be the period between global steps. The key result is that, under standard stochastic optimization assumptions, MuonBP achieves data efficiency (iteration complexity, final loss, perplexity) comparable to full Muon, provided both the block and global step sizes are matched to the corresponding spectral norms (Khaled et al., 19 Oct 2025).

The convergence theorem ("Convergence of MuonBP") demonstrates that the overall rate depends on a harmonic average of smoothness constants derived from the block and global parameter partitions. The communication–efficiency trade-off is explicit: larger $P$ increases throughput but can degrade the convergence bound; smaller $P$ ensures closer alignment with the global scheme but increases communication overhead.

Practically, MuonBP achieves an 8% increase in throughput compared to full Muon with no loss in model quality when training an 8B parameter model across eight-way tensor parallelism with ZeRO optimizer state sharding (Khaled et al., 19 Oct 2025).

4. Communication, Implementation, and Scaling

Blockwise orthogonalization is performed entirely locally, without any inter-device communication, making it highly scalable and suitable for large device-count training. Only at global synchronization steps is communication required. The block boundaries are aligned with the device’s parameter or gradient partition; for example, in column-parallel linear layers each $m \times (n / c)$ local matrix is treated as an atomic block.

Implementation details:

On each iteration, the local momentum buffer is updated as $M_t = \mu M_{t-1} + G_t$ .
For block-steps, $M_t$ is orthogonalized on each device independently.
For full steps, $M_t$ from all devices are gathered, a global orthogonalization is performed (using an iterative procedure such as Newton–Schulz), and the result is redistributed.
This methodology is compatible with Megatron-LM, FSDP, and ZeRO-based distributed training frameworks.

When increasing model size or device count, MuonBP's communication cost per update is effectively reduced by a factor of $P$ compared to baseline Muon.

5. Numerical and Empirical Properties

Experiments on diverse architectures and model sizes (from 160M to 8B parameters) confirm that MuonBP:

Achieves model training curves and final perplexity essentially indistinguishable from full Muon,
Matches or exceeds iteration complexity of full Muon,
Offers per-iteration wall-clock throughput similar to AdamW, with a throughput gain of approximately 8% over baseline Muon for large models (Khaled et al., 19 Oct 2025).

Block orthogonalization alone (without periodic global steps) can eventually incur lagging convergence as local blocks can diverge in their update geometry. Periodic global steps correct for this drift, explaining the observed trade-off between period length and performance.

Block-Periodic Orthogonalization sits at the intersection of communication-avoiding block Gram–Schmidt algorithms and modern scalable deep learning optimizers. Contemporary work demonstrates that for block classical Gram–Schmidt, periodic global reorthogonalization or use of a strong "intraorthogonalization" subroutine is crucial for maintaining numerical stability as the number of synchronization points is reduced (Carson et al., 2 May 2024, Carson et al., 19 Aug 2024). In the MuonBP setting, this theoretical insight translates to the requirement for regular global orthogonalization to prevent accumulation of blockwise numerical error and conditioning-induced loss of orthogonality.

Studies on scaling laws for iterative orthogonalization (Selvaraj, 6 May 2025) reveal that as the size of the matrices increases, the singular values of random gradient matrices contract, which can affect the efficacy of the polynomial orthogonalization on blocks. This necessitates careful design of the blockwise orthogonalizer, adjustment of polynomial parameters, or increased iteration count to ensure all directions (even those with small singular values) are adequately corrected—especially for very large blocks, common in current LLM training.

7. Practical Recommendation and Limitations

MuonBP strikes an effective balance between communication minimization and robust convergence in distributed optimization. Adoption requires setting an appropriate period $P$ (empirically, $P=5$ is robust for diverse models), using separate learning rates for block and full steps (tuned via RMS scaling), and ensuring the block partition matches the device's parameter sharding. The theoretical framework indicates that, for ill-conditioned parameter spaces or poorly chosen polynomial orthogonalization, increasing the frequency of global steps may be required.

Potential limitations include loss of orthogonality within blocks in the absence of full steps, and the need to retune polynomial orthogonalizer parameters for very large models to mitigate shrinking singular values, as indicated by random matrix theory and scaling law analyses (Selvaraj, 6 May 2025).

In summary, Block-Periodic Orthogonalization (MuonBP) provides a scalable and theoretically grounded framework for modern deep learning optimization with strong empirical and analytical support for its use in billion-parameter scale training (Khaled et al., 19 Oct 2025, Carson et al., 2 May 2024, Carson et al., 19 Aug 2024, Selvaraj, 6 May 2025).