Adafactor Algorithm

Updated 2 October 2025

Adafactor is an adaptive stochastic optimization algorithm that uses factored second-moment statistics to reduce memory overhead while delivering benefits similar to Adam and RMSProp.
It factorizes full gradient second-moment estimators into row and column statistics, lowering storage requirements from O(nm) to O(n+m) for matrix parameters.
Additional mechanisms like update clipping, dynamic decay schedules, and scale-invariant step sizes promote training stability and efficient large model optimization.

Adafactor is an adaptive stochastic optimization algorithm designed for training large-scale neural networks with significantly reduced auxiliary memory requirements. Adafactor achieves the benefits of second-moment adaptive learning rates, similar to Adam and RMSProp, while using a factored approximation for second-moment statistics that reduces optimizer state storage requirements from $O(nm)$ to $O(n + m)$ for matrix-shaped parameter tensors. This memory efficiency enables the deployment and training of large models where traditional optimizers may be impractical due to optimizer state memory overhead. Adafactor is further augmented by mechanisms for update clipping, schedule-controlled second-moment decay rates, and scale-invariant step sizes.

1. Factored Second-Moment Estimation

The central innovation of Adafactor is the factorization of second-moment statistics for matrix-shaped parameters. Instead of maintaining a full per-parameter matrix $V \in \mathbb{R}^{n \times m}$ where $V_{ij}$ is the exponential moving average of squared gradients $\mathbb{E}[g_{ij}^2]$ , Adafactor only tracks:

$R_t \in \mathbb{R}^n$ : running average of the row sums of the squared gradients.
$C_t \in \mathbb{R}^m$ : running average of the column sums of the squared gradients.

The full second-moment estimator $V_t$ is then approximated per-parameter as

$\hat{V}_{ij}^{(t)} = \frac{(R_t)_i (C_t)_j}{\sum_{k=1}^{n} (R_t)_k}$

This estimator is optimal under generalized Kullback-Leibler divergence minimization for the setting in which only row and column sums are available. For general $d$ -dimensional tensors, the factorization is performed along multiple axes, maintaining one accumulator per axis.

The update of $R_t$ and $C_t$ at each step is given by \begin{align*} R_t &= \beta_2 R_{t-1} + (1 - \beta_2)(G_t² \cdot \mathbf{1}m) \ C_t &= \beta_2 C{t-1} + (1 - \beta_2)(\mathbf{1}_n^T \cdot G_t²⁾ \end{align*} where $G_t$ is the gradient matrix at step $t$ , $\beta_2$ is the second-moment decay rate, and $\mathbf{1}_m, \mathbf{1}_n$ are all-ones vectors.

For vector (or scalar) parameters, Adafactor falls back to standard per-parameter second-moment tracking.

2. Memory Efficiency Considerations

Adaptive optimizers such as Adam require maintaining auxiliary tensors of the same shape as each parameter tensor (typically, running averages of first and second moment statistics), multiplying memory usage for optimizer state. In models with large embedding or dense layers, these auxiliary storage requirements can be prohibitive.

Adafactor’s factored representation for second-moment statistics reduces auxiliary storage for each $n \times m$ parameter matrix from $O(nm)$ to $O(n + m)$ . This property is critical in scaling to very large Transformer-based networks and other architectures with massive matrix parameters. When storing only factored second moments and omitting momentum (i.e., first moment accumulation), Adafactor may require orders of magnitude less memory for optimizer state compared to Adam.

3. Update Mechanism, Step-Size Schedules, and Clipping

Parameter updates in Adafactor use the factored second-moment approximation for adaptive scaling of gradients. The core update for parameter matrix $X$ is

$U_t = \frac{G_t}{\sqrt{\hat{V}_t} + \epsilon}$

where $\epsilon$ is a small constant for numerical stability.

Adafactor introduces additional key mechanisms:

Relative step size scaling: Step size $\alpha_t$ is set proportional to the RMS norm of $X_{t-1}$ :

$\alpha_t = \max(\epsilon_2, \text{RMS}(X_{t-1})) \cdot \rho_t$

where $\rho_t$ is the relative step schedule and $\epsilon_2$ is a small positive constant.

Update clipping: The raw update $U_t$ is clipped based on its RMS norm threshold $d$ :

$\bar{U}_t = \frac{U_t}{\max(1, \text{RMS}(U_t)/d)}$

to prevent transiently large parameter updates which can destabilize training, especially when second-moment statistics lag behind the instantaneous gradient magnitudes.

Gradual second-moment decay schedule: Instead of fixed $\beta_2$ , Adafactor supports time-varying decay rates:

$\hat{\beta}_{2, t} = 1 - 1/t^c$

for $c > 0$ , which improves adaptivity and stability at early and late training steps, and eliminates the need for bias correction.

By default, Adafactor may omit momentum (setting $\beta_1 = 0$ ), further reducing optimizer state, or optionally include first moment tracking, yielding updates nearly identical to Adam modulo the second-moment factored approximation.

4. Stability and Empirical Performance

Extensive empirical evaluation on the Transformer model applied to the WMT 2014 English-German machine translation benchmark demonstrated:

With standard learning rate schedules (e.g., linear warmup + inverse square-root decay), Adafactor matched Adam in BLEU score and convergence properties.
Without a warmup, Adam without momentum was highly unstable and diverged, while Adafactor (with update clipping and factored second-moment estimation) maintained stability.
The choice of $\beta_2$ is critical: slow decay can cause out-of-date statistics and oversized updates; fast decay over-suppresses updates. Update clipping and dynamic decay schedules mitigate these effects.

Subsequent work (Zhao et al., 10 Jul 2024) confirms that in large-scale language modeling, Adafactor and Adam perform comparably in terms of validation loss and hyperparameter robustness. The most pronounced difference is Adafactor’s memory footprint, making it preferable in resource-constrained settings.

Optimizer	Second-Moment Estimator Type	Memory Overhead	Empirical Performance
Adam	Per-parameter (full tensor)	%%%%35 $V_{ij}$ 36%%%% parameter size	Baseline
Adafactor	Factored (row/col for mats)	$O(n+m)$ per matrix	Comparable to Adam
Adapprox (Zhao et al., 22 Mar 2024)	Adaptive randomized low-rank	$O((n+m)k)$	Potentially higher accuracy with high-rank approximation

Adafactor distinguishes itself from Adam by its factored second-moment storage. While this introduces a coarser approximation in the adaptive scaling directions, empirical studies indicate no substantial loss in convergence profile or final model quality for large NLP and translation tasks.

Recent optimizers such as Adapprox (Zhao et al., 22 Mar 2024) expand upon memory-efficient second-moment approximations by introducing randomized low-rank matrices and dynamic rank selection, achieving even greater memory savings when omitting first moment statistics. However, Adafactor’s rank-1 structure remains canonical for minimalistic optimizer state.

Hybrid approaches such as AdamW–Adafactor (Wortsman et al., 2023) incorporate Adafactor’s update clipping into the AdamW framework, adaptively dampening parameter updates during transient mismatches in gradient statistics, and outperforming traditional gradient norm clipping.

6. Extensions, Connections, and Applications

Preconditioner Diagonalization (Nguyen et al., 11 Feb 2025): Integrating Adafactor with periodic SVD-based rotation of the gradient space (AdafacDiag) yields nearly diagonal preconditioners, further improving adaptive moment estimation and convergence, at the cost of modest computational overhead.
SOAP Optimizer (Vyas et al., 17 Sep 2024): Formal equivalence between Shampoo (higher-order preconditioning) and Adafactor in the rotated eigenbasis, leading to efficient hybrid optimizers that benefit from Adafactor’s diagonal preconditioning in the Shampoo eigenbasis.
Layerwise Adaptivity (Zhao et al., 10 Jul 2024): Adaptivity in last-layer and LayerNorm parameters is necessary for maintaining performance and stability, suggesting hybrid application of Adafactor or similar methods to targeted network regions.
Large-Scale Vision-Language Training (Wortsman et al., 2023): Adaptive update clipping from Adafactor stabilizes large-scale training under dynamic gradient statistics and quantization noise, particularly in architectures like CLIP ViT-Huge.

7. Practical Recommendations and Limitations

Adafactor is recommended in scenarios where memory consumption for optimizer state is a bottleneck, especially in architectures featuring large weight matrices (e.g., Transformer-based LLMs, vision architectures). Default hyperparameter choices (e.g., $\beta_1 = 0.9$ with momentum, $\beta_2 = 0.95$ for second moment) provide robust behavior and require minimal fine-tuning.

When deployed without momentum, Adafactor achieves its minimal memory footprint, at the cost of removing first-moment bias correction.
In hybrid approaches, update clipping and dynamic decay rates provide robust stability even in non-ideal learning rate schedules.
Separation between matrix- and vector-shaped parameters is required for correct per-layer factored statistics.
For extremely large models or heterogeneous computations, further reduction in memory or improvement in approximation may be obtained via randomized low-rank methods (Adapprox) or targeted adaptivity (e.g., SOAP, AdafacDiag).

Overall, Adafactor provides adaptive optimization performance similar to Adam, robust stability across a range of tasks and hyperparameter settings, and highly favorable memory scaling, especially in high-dimensional neural network training environments.