Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 82 tok/s

Gemini 2.5 Pro 62 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 36 tok/s Pro

GPT-4o 78 tok/s Pro

Kimi K2 195 tok/s Pro

GPT OSS 120B 423 tok/s Pro

Claude Sonnet 4.5 33 tok/s Pro

2000 character limit reached

Column-Normalized Adam (Conda)

Updated 30 September 2025

Column-Normalized Adam (Conda) is a stochastic optimization algorithm designed for large language model pre-training that combines Adam’s per-coordinate adaptivity with spectral projection for improved conditioning.
The method applies column-wise second moment normalization to balance transformation adaptivity with efficient spectral scaling, leading to faster convergence and lower validation perplexity.
Empirical results demonstrate that Conda achieves up to 2.5x faster convergence, offering robust training stability and enhanced generalization for transformer-based architectures.

Column-Normalized Adam (Conda) is a stochastic optimization algorithm specifically designed for efficient pre-training of LLMs. It integrates the per-coordinate adaptivity of Adam with spectral conditioning techniques introduced in Muon, yielding a column-wise normalization strategy that accelerates convergence, improves update conditioning, and retains fine-grained scaling crucial for transformer architectures. Conda has demonstrated notably faster convergence and superior generalization compared to AdamW and global spectral normalization methods, achieving up to $2{\sim}2.5\times$ speedup in training time and steps for state-of-the-art LLaMA and GPT-2 model series (Wang et al., 29 Sep 2025). The following sections provide a technical exposition of the methodology, theoretical background, empirical results, relevant connections, and anticipated research directions.

1. Motivation and Theoretical Underpinnings

The traditional Adam optimizer adapts the learning rate for each parameter element using running averages of first and second moments of the gradients. This mechanism is highly effective in diverse settings due to its robustness and coordinate-wise adaptivity. However, analyses of Adam’s dynamics in transformer-style architectures surfaced two detrimental phenomena:

Poor Spectral Conditioning: Adam’s update matrices are often highly anisotropic and exhibit low-rank structures. This leads to slow updating of certain spectral subspaces, impeding optimization efficiency, particularly in LLMs.
Imbalanced Update Distribution: Adam's element-wise normalization sometimes produces aggressive scaling in directions of small gradient variance, which can destabilize training and exacerbate loss surface ill-conditioning.

Muon addresses these by performing a global spectral normalization through orthogonal projections, flattening singular values of the update matrix. While this ensures well-conditioned updates, it forfeits Adam's granular adaptation. Conda bridges these techniques by (1) projecting updates into an orthogonal subspace to control spectral conditioning and (2) applying column-wise second moment normalization to preserve adaptivity for each weight matrix column. This design targets the efficient exploration and optimization of the full parameter space of large-scale transformers.

2. Algorithmic Formulation

The Conda update can be summarized as follows for a parameter matrix $W$ with gradient $G$ :

First Moment Update (Adam-style):

$M_t = \beta_1 M_{t-1} + (1 - \beta_1) G_t$

Spectral Projection:

Singular Value Decomposition (SVD) is performed on $M_t$ :

$U_t, \Sigma_t, V_t^\top = \text{SVD}(M_t)$

where $U_t$ spans the leading singular vectors.

Column-Wise Second Moment Update:

The gradient is projected into the $U_t$ subspace, and the second moment is accumulated per column:

$N_t = \beta_2 N_{t-1} + (1-\beta_2) (U_t^\top G_t)^2$

Parameter Update:

The effective update in the original space is constructed as

$W_{t} = W_{t-1} - \eta \, U_t \left( \frac{M_t}{\sqrt{N_t}} \right)$

The denominator, $\sqrt{N_t}$ , is computed for each projected column, yielding normalization that balances per-direction adaptivity against aggressive global scaling.

Conda’s normalization ensures that updates are (a) spectrally improved via projection, (b) fine-tuned per column to reflect moment dynamics, and (c) robust to singular value collapse or overscaling that can occur in Muon.

3. Convergence Properties and Conditioning

Theoretical work on Adam variants (Chen et al., 2021, Guo et al., 2021, Gould et al., 8 Nov 2024, Dereich et al., 28 Apr 2025) establishes that translating adaptivity and normalization from coordinate-wise to column-wise or block-wise structures enhances optimizer robustness and convergence. In particular:

Column-wise normalization reduces the effective Hessian condition number in problem subspaces, leading to an accelerated local convergence rate for Adam-like methods:

$\rho_{\mathrm{Adam,\,normalized}} = \frac{\sqrt{\kappa_{\mathrm{eff}}}-1}{\sqrt{\kappa_{\mathrm{eff}}}+1}$

where $\kappa_{\mathrm{eff}}$ denotes the column-wise conditioned ratio of Hessian eigenvalues (Dereich et al., 28 Apr 2025).

Using projected second moments, as in Conda, further mitigates excessive adaptivity in ill-posed directions, a phenomenon documented in SVD-based preconditioner diagonalization (Nguyen et al., 11 Feb 2025). This aligns well with the variance bounding conditions proven necessary for scaled Adam convergence (Guo et al., 2021).
Spectral projection maintains bounded update norms and avoids the exponential growth witnessed when hyperparameters fall outside the stability region characterized by

$C(\beta, \gamma) = \frac{2\beta(1-\gamma) - \gamma(1-\beta)}{\beta\gamma} > 0$

(Gould et al., 8 Nov 2024).

4. Empirical Performance and Applications

Experimental evaluation of Conda across the LLaMA and GPT-2 model series demonstrates superior performance:

Convergence Speed: Conda achieves $2{\sim}2.5\times$ faster convergence than AdamW on models ranging from 60M to 1B parameters. The improvement is consistent across both training steps and total wall-clock time, verified by validation perplexity curves (Wang et al., 29 Sep 2025).
Optimization Quality: Models trained with Conda reach lower validation perplexity and higher downstream accuracy on math reasoning and commonsense tasks compared to AdamW, Muon, Adafactor, and SOAP.
Training Stability: Ablation studies reveal robust behavior to hyperparameter scaling, sequence length, memory budgeting, and projection schedule, establishing Conda’s suitability for pre-training and fine-tuning large transformer-based architectures.

A summary comparison table illustrates the key empirical outcomes:

Optimizer	Convergence Speed (LLMs)	Validation Perplexity	Stability (Diverse Configs)
AdamW	Baseline	Higher	Moderate
Muon	Fast (global)	Lower	Less adaptive
SOAP	Moderate	Competitive	Variable
Conda	$2{\sim}2.5\times$	Lowest	High

5. Relation to Preconditioner Diagonalization and Isometric Optimization

Preconditioner diagonalization (Nguyen et al., 11 Feb 2025) and isometric optimizers (Jackson, 2023) incorporate orthogonal transformations and full-matrix normalization to enhance conditioning and decouple update magnitudes from undesirable parameter correlations. Conda builds on these principles by restricting SVD projection to columns, offering a computationally tractable mechanism for column-wise normalization, while retaining the expressivity and adaptivity necessary for transformer models.

Isometric optimizers enforce invariance to linear transformations, ensuring the Frobenius norm of updates is decoupled from input and gradient scaling. Conda leverages this by selecting the orthogonal projection basis via SVD, aligning updates to principal axes, and executing normalization within these axes—yielding balanced exploration and improved sample efficiency in highly overparameterized networks.

6. Practical Deployment Considerations

For large-scale LLM pre-training, the computational cost of SVD can be substantial. Conda addresses this by

Applying SVD projection only periodically, thus amortizing cost across steps
Using efficient subspace estimation techniques, such as truncated SVD or lazy evaluations
Ensuring compatibility with distributed and memory-efficient settings (e.g., Adafactor integration by projecting before low-rank factorization)

These strategies maintain a low overall compute overhead (typically under 10%) relative to AdamW, while delivering significant gains in convergence speed.

7. Future Directions and Open Challenges

Potential avenues for continued development of Conda include

Scaling projection-based normalization to models exceeding 13B parameters and expert mixture architectures
Algorithmic enhancement of second-moment estimation via alternative subspace projection or normalization stacking (cf. $k$ -Adam (Gould et al., 8 Nov 2024))
Computational optimizations in SVD computation for high-throughput distributed training
Theoretical analysis of the interplay between spectral normalization and generalization, especially in complex transformer topologies

Empirical and theoretical evidence to date supports Conda as a robust, efficient optimizer for large-scale LLM training, with significant implications for the computational cost and quality of future neural architectures.