Papers
Topics
Authors
Recent
Search
2000 character limit reached

MACRO: Msign-Aligned Riemannian Optimization

Updated 11 May 2026
  • The paper presents the MACRO framework that enforces geometry-aware constraints on weight matrices, ensuring provable convergence for nonconvex, stochastic optimization.
  • MACRO integrates the msign operator to extract steepest directions via tangent space projection, unifying techniques like RMSNorm and decoupled weight decay.
  • Empirical evaluations demonstrate that MACRO achieves stability and competitive perplexity in LLM pre-training through a single-loop, geometry-centric update.

The Msign-Aligned Constrained Riemannian Optimizer (MACRO) is a provably convergent, single-loop optimization framework designed for stochastic, nonconvex optimization problems with explicit manifold constraints, specifically motivated by LLM pre-training. MACRO systematically unifies and subsumes heuristic stabilization techniques—such as explicit normalization layers (e.g., RMSNorm) and decoupled weight decay—by enforcing geometry-aware constraints on weight matrices. It achieves stability and competitive pre-training perplexity while rigorously guaranteeing exact Riemannian optimization, offering both theoretical and practical advancements over conventional methods (An et al., 6 May 2026).

1. Problem Formulation and Constraint Geometry

MACRO addresses the optimization problem: minimizeL(W)=Eξ∼D[ℓ(W;ξ)]subject toW∈M\text{minimize} \quad \mathcal{L}(W) = \mathbb{E}_{\xi \sim D}[\ell(W;\xi)] \quad \text{subject to} \quad W \in \mathcal{M} where W∈Rn×mW \in \mathbb{R}^{n \times m} is a weight matrix, L\mathcal{L} is the expected loss, and M\mathcal{M} is a constraint manifold. Common choices for M\mathcal{M} include:

  • Frobenius sphere: MF(R)={W:∥W∥F=R}\mathcal{M}_F(R)=\{W : \|W\|_F = R\}
  • Spectral sphere: MS(R)={W:∥W∥2=R}\mathcal{M}_S(R)=\{W : \|W\|_2 = R\}
  • Oblique manifold: per-row or per-column â„“2\ell_2-norm == RR

For any W∈Rn×mW \in \mathbb{R}^{n \times m}0, the tangent space W∈Rn×mW \in \mathbb{R}^{n \times m}1 is defined as W∈Rn×mW \in \mathbb{R}^{n \times m}2, with W∈Rn×mW \in \mathbb{R}^{n \times m}3 reflecting the constraint (e.g., W∈Rn×mW \in \mathbb{R}^{n \times m}4 for Frobenius). The projection onto the tangent space (Riemannian gradient) for the Frobenius sphere is

W∈Rn×mW \in \mathbb{R}^{n \times m}5

and for the spectral sphere, given W∈Rn×mW \in \mathbb{R}^{n \times m}6 the top singular vectors,

W∈Rn×mW \in \mathbb{R}^{n \times m}7

Retracting back to the manifold after an update,

W∈Rn×mW \in \mathbb{R}^{n \times m}8

which, in practice, is W∈Rn×mW \in \mathbb{R}^{n \times m}9 for the Frobenius sphere and L\mathcal{L}0 (approximate) for the spectral sphere (An et al., 6 May 2026).

2. Msign-Aligned Update Rule

The core innovation of MACRO lies in its use of the matrix-sign (msign) operator, which is defined as L\mathcal{L}1 for the SVD L\mathcal{L}2. The msign operator solves the linear minimization oracle

L\mathcal{L}3

The optimization step at iteration L\mathcal{L}4 proceeds as follows:

  1. Project momentum onto tangent space: L\mathcal{L}5
  2. Extract steepest direction via LMO: L\mathcal{L}6
  3. Normalize and scale update direction: L\mathcal{L}7
  4. Descent plus retraction: L\mathcal{L}8

This procedure yields a Riemannian steepest-descent step aligned via the msign operator and combined with efficient, explicit retraction, thus performing a true single-loop geometric update (An et al., 6 May 2026).

3. Algorithmic Structure and Hyperparameters

Pseudocode for MACRO defines the following steps per iteration:

  1. L\mathcal{L}9
  2. M\mathcal{M}0
  3. M\mathcal{M}1
  4. M\mathcal{M}2
  5. M\mathcal{M}3
  6. M\mathcal{M}4

Optimal theoretical hyperparameter settings include M\mathcal{M}5, M\mathcal{M}6, and M\mathcal{M}7 selected via activation-control theory (detailed below). Batch sizes are typically 64–128. The single-loop structure provides practical efficiency, avoiding the need for double-loop exact projections as in some prior Riemannian solvers (An et al., 6 May 2026).

4. Theoretical Guarantees

Under the assumptions that:

  • (A1) M\mathcal{M}8 is a compact M\mathcal{M}9 manifold,
  • (A2) M\mathcal{M}0 is lower-bounded, M\mathcal{M}1-smooth, and the stochastic gradients are unbiased with variance M\mathcal{M}2,

it is established that, with suitable M\mathcal{M}3 and learning-rate schedule M\mathcal{M}4, the convergence rate satisfies: M\mathcal{M}5 This rate matches the optimal for stochastic, nonconvex, constrained problems. The proof leverages the retraction-smoothness lemma and momentum-variance control, with the msign operator ensuring that descent occurs along the direction of steepest Riemannian reduction (An et al., 6 May 2026).

5. Mechanistic Interpretations

Activation Scale Control: For a linear layer M\mathcal{M}6 and RMSM\mathcal{M}7, manifold constraints guarantee bounded activations:

  • Spectral sphere: choosing M\mathcal{M}8 ensures M\mathcal{M}9.
  • Frobenius sphere: MF(R)={W:∥W∥F=R}\mathcal{M}_F(R)=\{W : \|W\|_F = R\}0 ensures MF(R)={W:∥W∥F=R}\mathcal{M}_F(R)=\{W : \|W\|_F = R\}1 is bounded, yielding RMS at MF(R)={W:∥W∥F=R}\mathcal{M}_F(R)=\{W : \|W\|_F = R\}2. Hence, set MF(R)={W:∥W∥F=R}\mathcal{M}_F(R)=\{W : \|W\|_F = R\}3, MF(R)={W:∥W∥F=R}\mathcal{M}_F(R)=\{W : \|W\|_F = R\}4, MF(R)={W:∥W∥F=R}\mathcal{M}_F(R)=\{W : \|W\|_F = R\}5 per activation-bound theory.

Interplay with RMSNorm: In transformer architectures, RMSNorm learns a per-layer scale MF(R)={W:∥W∥F=R}\mathcal{M}_F(R)=\{W : \|W\|_F = R\}6. As MF(R)={W:∥W∥F=R}\mathcal{M}_F(R)=\{W : \|W\|_F = R\}7 increases, MF(R)={W:∥W∥F=R}\mathcal{M}_F(R)=\{W : \|W\|_F = R\}8 shrinks to keep post-norm activation constant. If all learnable RMSNorms are removed, standard optimizers diverge, but MACRO remains stable at standard learning rates (3e-3–1e-2), demonstrating the sufficiency of explicit manifold constraints for scale control.

Interaction with Weight Decay: Conventional decoupled weight decay heuristically enforces:

  • Relative learning rate MF(R)={W:∥W∥F=R}\mathcal{M}_F(R)=\{W : \|W\|_F = R\}9,
  • Rotational equilibrium MS(R)={W:∥W∥2=R}\mathcal{M}_S(R)=\{W : \|W\|_2 = R\}0 const.

MACRO enforces these exactly from initialization:

  • Relative-LR: MS(R)={W:∥W∥2=R}\mathcal{M}_S(R)=\{W : \|W\|_2 = R\}1
  • Frobenius rotational equilibrium: rotational angle MS(R)={W:∥W∥2=R}\mathcal{M}_S(R)=\{W : \|W\|_2 = R\}2
  • Spectral rotational equilibrium: MS(R)={W:∥W∥2=R}\mathcal{M}_S(R)=\{W : \|W\|_2 = R\}3 where the spectral gap modulates rotation.

Thus, MACRO subsumes the functionality of both RMSNorm and weight decay via explicit geometry (An et al., 6 May 2026).

6. Empirical Evaluation

Evaluations on QWEN3-like architectures (RoPE+GQA+SwiGLU) at 120M, 330M, and 1B parameters (with OpenWebText as the dataset, and token budgets 3.7B/8.9B/50B) show that MACRO achieves validation perplexities competitive with or superior to baselines, including Muon, MuonH-fro/spec, SSO, and FSO. For example:

Model Muon MuonH-fro MuonH-spec SSO FSO MACRO-fro MACRO-spec
120M 3.019 3.007 3.019 3.011 3.001 3.005 3.017
330M 2.736 2.717 2.716 2.712 2.726 2.718 2.714
1B 2.473 2.468 2.464 -- -- 2.467 2.461

In normalization-free settings (removing learnable RMSNorm at 330M, LR=MS(R)={W:∥W∥2=R}\mathcal{M}_S(R)=\{W : \|W\|_2 = R\}4), Muon diverges, while MACRO remains stable with validation losses near 2.76 (fro) and 2.74 (spec).

Further, gradient norms under MACRO decay smoothly (approximately 30× during training) and there is no late-stage blowup seen in AdamW+weight-decay. Zero-shot μP transfer shows optimal learning rates remain stable under MACRO when scaling width. Tangent-space projection residuals remain low (MS(R)={W:∥W∥2=R}\mathcal{M}_S(R)=\{W : \|W\|_2 = R\}5–MS(R)={W:∥W∥2=R}\mathcal{M}_S(R)=\{W : \|W\|_2 = R\}6), and performance is comparable to double-loop methods despite its single-loop efficiency (An et al., 6 May 2026).

7. Implementation Insights and Recommendations

MACRO is best suited to scenarios requiring rigorous, geometry-aware constraints for stability in deep or large LLM pre-training, especially when minimal tuning of RMSNorm and weight decay is desired. Implementation guidelines include:

  • Choose MS(R)={W:∥W∥2=R}\mathcal{M}_S(R)=\{W : \|W\|_2 = R\}7 via activation-bound theory: MS(R)={W:∥W∥2=R}\mathcal{M}_S(R)=\{W : \|W\|_2 = R\}8, MS(R)={W:∥W∥2=R}\mathcal{M}_S(R)=\{W : \|W\|_2 = R\}9 with â„“2\ell_20.
  • Set alignment â„“2\ell_21 (explore â„“2\ell_22).
  • Use â„“2\ell_23 or 0.9 for momentum.
  • Approximate spectral retraction by normalizing with â„“2\ell_24.
  • Monitor gradient norms and NaN incidence.

Possible pitfalls include setting â„“2\ell_25 too small (leading to under-capacity) or too large (leading to loss of scale control), and approximate spectral retractions can violate strict compactness in rare cases of nearly repeated singular values. Omitting the tangent space projection step negates Riemannian guarantees but is inexpensive when included.

In summary, MACRO delivers a unified, geometry-centric approach to LLM pre-training optimization, obviating the need for extensive heuristic tuning of normalization and weight regularization, and achieves strong stability and competitive perplexity in practice (An et al., 6 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Msign-Aligned Constrained Riemannian Optimizer (MACRO).