Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 62 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 78 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 423 tok/s Pro
Claude Sonnet 4.5 33 tok/s Pro
2000 character limit reached

Column-Normalized Adam (Conda)

Updated 30 September 2025
  • Column-Normalized Adam (Conda) is a stochastic optimization algorithm designed for large language model pre-training that combines Adam’s per-coordinate adaptivity with spectral projection for improved conditioning.
  • The method applies column-wise second moment normalization to balance transformation adaptivity with efficient spectral scaling, leading to faster convergence and lower validation perplexity.
  • Empirical results demonstrate that Conda achieves up to 2.5x faster convergence, offering robust training stability and enhanced generalization for transformer-based architectures.

Column-Normalized Adam (Conda) is a stochastic optimization algorithm specifically designed for efficient pre-training of LLMs. It integrates the per-coordinate adaptivity of Adam with spectral conditioning techniques introduced in Muon, yielding a column-wise normalization strategy that accelerates convergence, improves update conditioning, and retains fine-grained scaling crucial for transformer architectures. Conda has demonstrated notably faster convergence and superior generalization compared to AdamW and global spectral normalization methods, achieving up to 22.5×2{\sim}2.5\times speedup in training time and steps for state-of-the-art LLaMA and GPT-2 model series (Wang et al., 29 Sep 2025). The following sections provide a technical exposition of the methodology, theoretical background, empirical results, relevant connections, and anticipated research directions.

1. Motivation and Theoretical Underpinnings

The traditional Adam optimizer adapts the learning rate for each parameter element using running averages of first and second moments of the gradients. This mechanism is highly effective in diverse settings due to its robustness and coordinate-wise adaptivity. However, analyses of Adam’s dynamics in transformer-style architectures surfaced two detrimental phenomena:

  • Poor Spectral Conditioning: Adam’s update matrices are often highly anisotropic and exhibit low-rank structures. This leads to slow updating of certain spectral subspaces, impeding optimization efficiency, particularly in LLMs.
  • Imbalanced Update Distribution: Adam's element-wise normalization sometimes produces aggressive scaling in directions of small gradient variance, which can destabilize training and exacerbate loss surface ill-conditioning.

Muon addresses these by performing a global spectral normalization through orthogonal projections, flattening singular values of the update matrix. While this ensures well-conditioned updates, it forfeits Adam's granular adaptation. Conda bridges these techniques by (1) projecting updates into an orthogonal subspace to control spectral conditioning and (2) applying column-wise second moment normalization to preserve adaptivity for each weight matrix column. This design targets the efficient exploration and optimization of the full parameter space of large-scale transformers.

2. Algorithmic Formulation

The Conda update can be summarized as follows for a parameter matrix WW with gradient GG:

  • First Moment Update (Adam-style):

Mt=β1Mt1+(1β1)GtM_t = \beta_1 M_{t-1} + (1 - \beta_1) G_t

  • Spectral Projection:

Singular Value Decomposition (SVD) is performed on MtM_t:

Ut,Σt,Vt=SVD(Mt)U_t, \Sigma_t, V_t^\top = \text{SVD}(M_t)

where UtU_t spans the leading singular vectors.

  • Column-Wise Second Moment Update:

The gradient is projected into the UtU_t subspace, and the second moment is accumulated per column:

Nt=β2Nt1+(1β2)(UtGt)2N_t = \beta_2 N_{t-1} + (1-\beta_2) (U_t^\top G_t)^2

  • Parameter Update:

The effective update in the original space is constructed as

Wt=Wt1ηUt(MtNt)W_{t} = W_{t-1} - \eta \, U_t \left( \frac{M_t}{\sqrt{N_t}} \right)

The denominator, Nt\sqrt{N_t}, is computed for each projected column, yielding normalization that balances per-direction adaptivity against aggressive global scaling.

Conda’s normalization ensures that updates are (a) spectrally improved via projection, (b) fine-tuned per column to reflect moment dynamics, and (c) robust to singular value collapse or overscaling that can occur in Muon.

3. Convergence Properties and Conditioning

Theoretical work on Adam variants (Chen et al., 2021, Guo et al., 2021, Gould et al., 8 Nov 2024, Dereich et al., 28 Apr 2025) establishes that translating adaptivity and normalization from coordinate-wise to column-wise or block-wise structures enhances optimizer robustness and convergence. In particular:

  • Column-wise normalization reduces the effective Hessian condition number in problem subspaces, leading to an accelerated local convergence rate for Adam-like methods:

ρAdam,normalized=κeff1κeff+1\rho_{\mathrm{Adam,\,normalized}} = \frac{\sqrt{\kappa_{\mathrm{eff}}}-1}{\sqrt{\kappa_{\mathrm{eff}}}+1}

where κeff\kappa_{\mathrm{eff}} denotes the column-wise conditioned ratio of Hessian eigenvalues (Dereich et al., 28 Apr 2025).

  • Using projected second moments, as in Conda, further mitigates excessive adaptivity in ill-posed directions, a phenomenon documented in SVD-based preconditioner diagonalization (Nguyen et al., 11 Feb 2025). This aligns well with the variance bounding conditions proven necessary for scaled Adam convergence (Guo et al., 2021).
  • Spectral projection maintains bounded update norms and avoids the exponential growth witnessed when hyperparameters fall outside the stability region characterized by

C(β,γ)=2β(1γ)γ(1β)βγ>0C(\beta, \gamma) = \frac{2\beta(1-\gamma) - \gamma(1-\beta)}{\beta\gamma} > 0

(Gould et al., 8 Nov 2024).

4. Empirical Performance and Applications

Experimental evaluation of Conda across the LLaMA and GPT-2 model series demonstrates superior performance:

  • Convergence Speed: Conda achieves 22.5×2{\sim}2.5\times faster convergence than AdamW on models ranging from 60M to 1B parameters. The improvement is consistent across both training steps and total wall-clock time, verified by validation perplexity curves (Wang et al., 29 Sep 2025).
  • Optimization Quality: Models trained with Conda reach lower validation perplexity and higher downstream accuracy on math reasoning and commonsense tasks compared to AdamW, Muon, Adafactor, and SOAP.
  • Training Stability: Ablation studies reveal robust behavior to hyperparameter scaling, sequence length, memory budgeting, and projection schedule, establishing Conda’s suitability for pre-training and fine-tuning large transformer-based architectures.

A summary comparison table illustrates the key empirical outcomes:

Optimizer Convergence Speed (LLMs) Validation Perplexity Stability (Diverse Configs)
AdamW Baseline Higher Moderate
Muon Fast (global) Lower Less adaptive
SOAP Moderate Competitive Variable
Conda 22.5×2{\sim}2.5\times Lowest High

5. Relation to Preconditioner Diagonalization and Isometric Optimization

Preconditioner diagonalization (Nguyen et al., 11 Feb 2025) and isometric optimizers (Jackson, 2023) incorporate orthogonal transformations and full-matrix normalization to enhance conditioning and decouple update magnitudes from undesirable parameter correlations. Conda builds on these principles by restricting SVD projection to columns, offering a computationally tractable mechanism for column-wise normalization, while retaining the expressivity and adaptivity necessary for transformer models.

Isometric optimizers enforce invariance to linear transformations, ensuring the Frobenius norm of updates is decoupled from input and gradient scaling. Conda leverages this by selecting the orthogonal projection basis via SVD, aligning updates to principal axes, and executing normalization within these axes—yielding balanced exploration and improved sample efficiency in highly overparameterized networks.

6. Practical Deployment Considerations

For large-scale LLM pre-training, the computational cost of SVD can be substantial. Conda addresses this by

  • Applying SVD projection only periodically, thus amortizing cost across steps
  • Using efficient subspace estimation techniques, such as truncated SVD or lazy evaluations
  • Ensuring compatibility with distributed and memory-efficient settings (e.g., Adafactor integration by projecting before low-rank factorization)

These strategies maintain a low overall compute overhead (typically under 10%) relative to AdamW, while delivering significant gains in convergence speed.

7. Future Directions and Open Challenges

Potential avenues for continued development of Conda include

  • Scaling projection-based normalization to models exceeding 13B parameters and expert mixture architectures
  • Algorithmic enhancement of second-moment estimation via alternative subspace projection or normalization stacking (cf. kk-Adam (Gould et al., 8 Nov 2024))
  • Computational optimizations in SVD computation for high-throughput distributed training
  • Theoretical analysis of the interplay between spectral normalization and generalization, especially in complex transformer topologies

Empirical and theoretical evidence to date supports Conda as a robust, efficient optimizer for large-scale LLM training, with significant implications for the computational cost and quality of future neural architectures.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Column-Normalized Adam (Conda).