Papers
Topics
Authors
Recent
Search
2000 character limit reached

AdaMax: Robust Adaptive Optimization

Updated 6 April 2026
  • AdaMax is an adaptive optimization algorithm that leverages the L∞ norm for gradient normalization, ensuring robustness against bursty or sparse gradients.
  • It maintains per-parameter first moment estimates and uses an exponential moving maximum instead of the L2-norm, enhancing numerical stability.
  • AdaMax is particularly useful in situations with large gradient spikes, though its conservative update steps may lead to slower convergence compared to alternatives.

AdaMax is an adaptive stochastic optimization algorithm and a member of the Adam family of optimizers, characterized by the use of the exponentially weighted infinity norm (L∞-norm) to estimate gradient magnitude. It was introduced by Kingma and Ba in "Adam: A Method for Stochastic Optimization" (Kingma et al., 2014) and formalized as a robust alternative to Adam for scenarios with large, sparse, or “spiky” gradients. AdaMax maintains per-parameter first moment estimates and normalizes updates by the maximum absolute gradient in each coordinate, instead of the root-mean-square used by Adam.

1. Theoretical Motivation and Design

AdaMax replaces the second-moment (L₂-norm–based) scaling in Adam with an L∞-norm–style bound on the gradient. For a parameter vector θ\theta and stochastic gradient gtg_t at iteration tt, Adam computes an exponentially decaying average of the squared gradients (second moment) leading to normalization by root-mean-square. In contexts with heavy-tailed, sparse, or highly variable gradients, the L₂-based moment is sensitive to outliers and large excursions, sometimes causing update scaling that either collapses or stalls.

AdaMax addresses this by maintaining, for each parameter ii, the exponential moving maximum:

ut,i=max(β2ut1,i,gt,i)u_{t,i} = \max(\beta_2 u_{t-1,i}, |g_{t,i}|)

where β2\beta_2 is a decay parameter and u0,i=0u_{0,i} = 0. This estimate is more robust to occasional bursts in gradient magnitude—when an outlier appears, the per-parameter normalization “remembers” large values without skewing the overall scale for other updates (Kingma et al., 2014, Shulman, 2023, Ruder, 2016). This substitution results in greater numerical stability and insensitivity to large, rare gradients.

2. Mathematical Formulation and Pseudocode

AdaMax operates as follows:

  • Let θt\theta_t denote the parameter vector at step tt.
  • gt=θL(θt1)g_t = \nabla_{\theta} L(\theta_{t-1}) is the gradient of the objective.
  • gtg_t0 tracks the exponentially weighted first moment of the gradient:

gtg_t1

where gtg_t2 is typically 0.9.

  • gtg_t3 is the exponentially weighted infinity norm (L∞ estimation):

gtg_t4

where gtg_t5 is typically 0.999.

Only the first moment estimator is subject to bias correction:

gtg_t6

The AdaMax parameter update for each component gtg_t7 is:

gtg_t8

where gtg_t9 is the learning rate (default tt0) and tt1 a small constant for numerical stability (default tt2) (Kingma et al., 2014, Shulman, 2023, Ruder, 2016).

Pseudocode:

ut,i=max(β2ut1,i,gt,i)u_{t,i} = \max(\beta_2 u_{t-1,i}, |g_{t,i}|)2 (Kingma et al., 2014, Shulman, 2023, Ruder, 2016)

3. Hyperparameter Choices and Implementation Practices

The canonical parameter values recommended by Kingma & Ba and in subsequent surveys are:

  • Learning rate tt3
  • First moment decay tt4
  • Infinity-norm decay tt5
  • Numerical stability constant tt6

Practical tuning guidance:

  • If training is slow, increase tt7 up to tt8; if diverging, reduce tt9
  • Higher ii0 smooths momentum but slows adaptation
  • Lower ii1 allows faster adaptation of ii2 to gradient bursts
  • ii3 is typically left at ii4 unless sparsity causes instability (Shulman, 2023, Ruder, 2016)

4. Empirical Performance and Energy Efficiency

A broad empirical study on Apple M1 Pro hardware investigated AdaMax alongside Adam, SGD, AdamW, NAdam, and others using MNIST, CIFAR-10, and CIFAR-100 with modern CNNs and 15 random seeds per setting (Almog, 16 Sep 2025):

Dataset Final Accuracy Training Duration (s) CO₂ Emissions (kg)
MNIST 98.00% ± 0.33% 18.1 ± 4.6 1.2×10⁻⁶ ± 5.3×10⁻⁷
CIFAR-10 66.53% ± 4.14% (top) 114.7 ± 22.9 1.89×10⁻⁵ ± 3.7×10⁻⁶
CIFAR-100 9.89% ± 5.11% 82.6 ± 43.8 8.1×10⁻⁷ ± 4.5×10⁻⁷

Empirical findings include:

  • AdaMax generally produces top or near-top accuracy on moderate-complexity tasks (best on CIFAR-10).
  • Training duration is among the slowest due to more conservative steps imposed by the ∞-norm scaling.
  • CO₂ emissions are highest on tasks where long convergence times are required (notably in CIFAR-10); emission levels are middling for simpler tasks.
  • Efficiency (accuracy per time or kg of emissions) is lower than AdamW, NAdam, or even SGD in most regimes.

The results indicate that although AdaMax is robust and often reaches high accuracy, its environmental footprint and runtime should be weighed in practice (Almog, 16 Sep 2025).

AdaMax’s use of the L∞-norm normalization differentiates it operationally from Adam and RMSProp, both of which rely on exponentially weighted L2-norm statistics:

  • Resilience to Outliers: AdaMax’s per-coordinate max operation prevents the denominator from collapsing after spiky gradients, preserving the optimizer’s ability to make meaningful updates without indefinite attenuation (Kingma et al., 2014, Shulman, 2023).
  • Simplicity: No bias correction is required for ii5 (max operation’s exponential decay can only be towards established maxima); no square roots are required, in contrast to Adam and RMSProp.
  • Potential Drawbacks: The ∞-norm may be dominated by one large coordinate, causing permanently reduced step sizes there (“over-penalization”), potentially making adaptation too conservative for some tasks. Adam’s RMS approach can adapt more finely in these cases (Ruder, 2016, Shulman, 2023).

Table: Adam vs. AdaMax Update Differences

Feature Adam AdaMax
2nd moment estimate ii6 (L2) ii7 (L∞)
Denominator bias corr. Yes No
Step size after outlier Can become very small Decays but cannot collapse
Convergence bounds ii8 ii9

6. Convergence Properties and Theoretical Guarantees

Kingma & Ba established that AdaMax maintains the ut,i=max(β2ut1,i,gt,i)u_{t,i} = \max(\beta_2 u_{t-1,i}, |g_{t,i}|)0 regret bound typical of adaptive first-order optimizers in the online convex optimization framework under standard assumptions (bounded, appropriately decaying moments) (Kingma et al., 2014). The elemental properties of the max operation in ut,i=max(β2ut1,i,gt,i)u_{t,i} = \max(\beta_2 u_{t-1,i}, |g_{t,i}|)1 satisfy the necessary non-increasing weighting structure for analytic convergence. Omission of the second-moment bias correction in AdaMax does not impact convergence guarantees, and no degradation was observed empirically or theoretically (Kingma et al., 2014, Shulman, 2023, Ruder, 2016).

7. Best Practices and Use Cases

Practitioners are advised to consider AdaMax primarily when:

  • Training with extremely large, sparse, or bursty gradients (notably in deep RNNs or scenarios with significant unstable dynamics).
  • Experiencing stalling or non-recovery in Adam after rare gradient spikes (Shulman, 2023).
  • Seeking maximal numerical stability at the expense of slower convergence and increased energy use (Almog, 16 Sep 2025).

For routine or resource-constrained tasks, other optimizers (notably AdamW or NAdam) usually offer better speed and efficiency with similar or better generalization (Almog, 16 Sep 2025). AdaMax serves as a specialist optimizer for stability-critical settings rather than a default option for energy-efficient, large-scale workloads.


Key references:

  • D. P. Kingma and J. L. Ba, "Adam: A Method for Stochastic Optimization" (Kingma et al., 2014)
  • "Optimization Methods in Deep Learning: A Comprehensive Overview" (Shulman, 2023)
  • "An overview of gradient descent optimization algorithms" (Ruder, 2016)
  • "An Analysis of Optimizer Choice on Energy Efficiency and Performance in Neural Network Training" (Almog, 16 Sep 2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdaMax.