MACRO: Msign-Aligned Riemannian Optimization
- The paper presents the MACRO framework that enforces geometry-aware constraints on weight matrices, ensuring provable convergence for nonconvex, stochastic optimization.
- MACRO integrates the msign operator to extract steepest directions via tangent space projection, unifying techniques like RMSNorm and decoupled weight decay.
- Empirical evaluations demonstrate that MACRO achieves stability and competitive perplexity in LLM pre-training through a single-loop, geometry-centric update.
The Msign-Aligned Constrained Riemannian Optimizer (MACRO) is a provably convergent, single-loop optimization framework designed for stochastic, nonconvex optimization problems with explicit manifold constraints, specifically motivated by LLM pre-training. MACRO systematically unifies and subsumes heuristic stabilization techniques—such as explicit normalization layers (e.g., RMSNorm) and decoupled weight decay—by enforcing geometry-aware constraints on weight matrices. It achieves stability and competitive pre-training perplexity while rigorously guaranteeing exact Riemannian optimization, offering both theoretical and practical advancements over conventional methods (An et al., 6 May 2026).
1. Problem Formulation and Constraint Geometry
MACRO addresses the optimization problem: where is a weight matrix, is the expected loss, and is a constraint manifold. Common choices for include:
- Frobenius sphere:
- Spectral sphere:
- Oblique manifold: per-row or per-column -norm
For any 0, the tangent space 1 is defined as 2, with 3 reflecting the constraint (e.g., 4 for Frobenius). The projection onto the tangent space (Riemannian gradient) for the Frobenius sphere is
5
and for the spectral sphere, given 6 the top singular vectors,
7
Retracting back to the manifold after an update,
8
which, in practice, is 9 for the Frobenius sphere and 0 (approximate) for the spectral sphere (An et al., 6 May 2026).
2. Msign-Aligned Update Rule
The core innovation of MACRO lies in its use of the matrix-sign (msign) operator, which is defined as 1 for the SVD 2. The msign operator solves the linear minimization oracle
3
The optimization step at iteration 4 proceeds as follows:
- Project momentum onto tangent space: 5
- Extract steepest direction via LMO: 6
- Normalize and scale update direction: 7
- Descent plus retraction: 8
This procedure yields a Riemannian steepest-descent step aligned via the msign operator and combined with efficient, explicit retraction, thus performing a true single-loop geometric update (An et al., 6 May 2026).
3. Algorithmic Structure and Hyperparameters
Pseudocode for MACRO defines the following steps per iteration:
- 9
- 0
- 1
- 2
- 3
- 4
Optimal theoretical hyperparameter settings include 5, 6, and 7 selected via activation-control theory (detailed below). Batch sizes are typically 64–128. The single-loop structure provides practical efficiency, avoiding the need for double-loop exact projections as in some prior Riemannian solvers (An et al., 6 May 2026).
4. Theoretical Guarantees
Under the assumptions that:
- (A1) 8 is a compact 9 manifold,
- (A2) 0 is lower-bounded, 1-smooth, and the stochastic gradients are unbiased with variance 2,
it is established that, with suitable 3 and learning-rate schedule 4, the convergence rate satisfies: 5 This rate matches the optimal for stochastic, nonconvex, constrained problems. The proof leverages the retraction-smoothness lemma and momentum-variance control, with the msign operator ensuring that descent occurs along the direction of steepest Riemannian reduction (An et al., 6 May 2026).
5. Mechanistic Interpretations
Activation Scale Control: For a linear layer 6 and RMS7, manifold constraints guarantee bounded activations:
- Spectral sphere: choosing 8 ensures 9.
- Frobenius sphere: 0 ensures 1 is bounded, yielding RMS at 2. Hence, set 3, 4, 5 per activation-bound theory.
Interplay with RMSNorm: In transformer architectures, RMSNorm learns a per-layer scale 6. As 7 increases, 8 shrinks to keep post-norm activation constant. If all learnable RMSNorms are removed, standard optimizers diverge, but MACRO remains stable at standard learning rates (3e-3–1e-2), demonstrating the sufficiency of explicit manifold constraints for scale control.
Interaction with Weight Decay: Conventional decoupled weight decay heuristically enforces:
- Relative learning rate 9,
- Rotational equilibrium 0 const.
MACRO enforces these exactly from initialization:
- Relative-LR: 1
- Frobenius rotational equilibrium: rotational angle 2
- Spectral rotational equilibrium: 3 where the spectral gap modulates rotation.
Thus, MACRO subsumes the functionality of both RMSNorm and weight decay via explicit geometry (An et al., 6 May 2026).
6. Empirical Evaluation
Evaluations on QWEN3-like architectures (RoPE+GQA+SwiGLU) at 120M, 330M, and 1B parameters (with OpenWebText as the dataset, and token budgets 3.7B/8.9B/50B) show that MACRO achieves validation perplexities competitive with or superior to baselines, including Muon, MuonH-fro/spec, SSO, and FSO. For example:
| Model | Muon | MuonH-fro | MuonH-spec | SSO | FSO | MACRO-fro | MACRO-spec |
|---|---|---|---|---|---|---|---|
| 120M | 3.019 | 3.007 | 3.019 | 3.011 | 3.001 | 3.005 | 3.017 |
| 330M | 2.736 | 2.717 | 2.716 | 2.712 | 2.726 | 2.718 | 2.714 |
| 1B | 2.473 | 2.468 | 2.464 | -- | -- | 2.467 | 2.461 |
In normalization-free settings (removing learnable RMSNorm at 330M, LR=4), Muon diverges, while MACRO remains stable with validation losses near 2.76 (fro) and 2.74 (spec).
Further, gradient norms under MACRO decay smoothly (approximately 30× during training) and there is no late-stage blowup seen in AdamW+weight-decay. Zero-shot μP transfer shows optimal learning rates remain stable under MACRO when scaling width. Tangent-space projection residuals remain low (5–6), and performance is comparable to double-loop methods despite its single-loop efficiency (An et al., 6 May 2026).
7. Implementation Insights and Recommendations
MACRO is best suited to scenarios requiring rigorous, geometry-aware constraints for stability in deep or large LLM pre-training, especially when minimal tuning of RMSNorm and weight decay is desired. Implementation guidelines include:
- Choose 7 via activation-bound theory: 8, 9 with 0.
- Set alignment 1 (explore 2).
- Use 3 or 0.9 for momentum.
- Approximate spectral retraction by normalizing with 4.
- Monitor gradient norms and NaN incidence.
Possible pitfalls include setting 5 too small (leading to under-capacity) or too large (leading to loss of scale control), and approximate spectral retractions can violate strict compactness in rare cases of nearly repeated singular values. Omitting the tangent space projection step negates Riemannian guarantees but is inexpensive when included.
In summary, MACRO delivers a unified, geometry-centric approach to LLM pre-training optimization, obviating the need for extensive heuristic tuning of normalization and weight regularization, and achieves strong stability and competitive perplexity in practice (An et al., 6 May 2026).