AdaMax: Robust Adaptive Optimization

Updated 6 April 2026

AdaMax is an adaptive optimization algorithm that leverages the L∞ norm for gradient normalization, ensuring robustness against bursty or sparse gradients.
It maintains per-parameter first moment estimates and uses an exponential moving maximum instead of the L2-norm, enhancing numerical stability.
AdaMax is particularly useful in situations with large gradient spikes, though its conservative update steps may lead to slower convergence compared to alternatives.

AdaMax is an adaptive stochastic optimization algorithm and a member of the Adam family of optimizers, characterized by the use of the exponentially weighted infinity norm (L∞-norm) to estimate gradient magnitude. It was introduced by Kingma and Ba in "Adam: A Method for Stochastic Optimization" (Kingma et al., 2014) and formalized as a robust alternative to Adam for scenarios with large, sparse, or “spiky” gradients. AdaMax maintains per-parameter first moment estimates and normalizes updates by the maximum absolute gradient in each coordinate, instead of the root-mean-square used by Adam.

1. Theoretical Motivation and Design

AdaMax replaces the second-moment (L₂-norm–based) scaling in Adam with an L∞-norm–style bound on the gradient. For a parameter vector $\theta$ and stochastic gradient $g_t$ at iteration $t$ , Adam computes an exponentially decaying average of the squared gradients (second moment) leading to normalization by root-mean-square. In contexts with heavy-tailed, sparse, or highly variable gradients, the L₂-based moment is sensitive to outliers and large excursions, sometimes causing update scaling that either collapses or stalls.

AdaMax addresses this by maintaining, for each parameter $i$ , the exponential moving maximum:

$u_{t,i} = \max(\beta_2 u_{t-1,i}, |g_{t,i}|)$

where $\beta_2$ is a decay parameter and $u_{0,i} = 0$ . This estimate is more robust to occasional bursts in gradient magnitude—when an outlier appears, the per-parameter normalization “remembers” large values without skewing the overall scale for other updates (Kingma et al., 2014, Shulman, 2023, Ruder, 2016). This substitution results in greater numerical stability and insensitivity to large, rare gradients.

2. Mathematical Formulation and Pseudocode

AdaMax operates as follows:

Let $\theta_t$ denote the parameter vector at step $t$ .
$g_t = \nabla_{\theta} L(\theta_{t-1})$ is the gradient of the objective.
$g_t$ 0 tracks the exponentially weighted first moment of the gradient:

$g_t$ 1

where $g_t$ 2 is typically 0.9.

$g_t$ 3 is the exponentially weighted infinity norm (L∞ estimation):

$g_t$ 4

where $g_t$ 5 is typically 0.999.

Only the first moment estimator is subject to bias correction:

$g_t$ 6

The AdaMax parameter update for each component $g_t$ 7 is:

$g_t$ 8

where $g_t$ 9 is the learning rate (default $t$ 0) and $t$ 1 a small constant for numerical stability (default $t$ 2) (Kingma et al., 2014, Shulman, 2023, Ruder, 2016).

Pseudocode:

$u_{t,i} = \max(\beta_2 u_{t-1,i}, |g_{t,i}|)$ 2 (Kingma et al., 2014, Shulman, 2023, Ruder, 2016)

3. Hyperparameter Choices and Implementation Practices

The canonical parameter values recommended by Kingma & Ba and in subsequent surveys are:

Learning rate $t$ 3
First moment decay $t$ 4
Infinity-norm decay $t$ 5
Numerical stability constant $t$ 6

Practical tuning guidance:

If training is slow, increase $t$ 7 up to $t$ 8; if diverging, reduce $t$ 9
Higher $i$ 0 smooths momentum but slows adaptation
Lower $i$ 1 allows faster adaptation of $i$ 2 to gradient bursts
$i$ 3 is typically left at $i$ 4 unless sparsity causes instability (Shulman, 2023, Ruder, 2016)

4. Empirical Performance and Energy Efficiency

A broad empirical study on Apple M1 Pro hardware investigated AdaMax alongside Adam, SGD, AdamW, NAdam, and others using MNIST, CIFAR-10, and CIFAR-100 with modern CNNs and 15 random seeds per setting (Almog, 16 Sep 2025):

Dataset	Final Accuracy	Training Duration (s)	CO₂ Emissions (kg)
MNIST	98.00% ± 0.33%	18.1 ± 4.6	1.2×10⁻⁶ ± 5.3×10⁻⁷
CIFAR-10	66.53% ± 4.14% (top)	114.7 ± 22.9	1.89×10⁻⁵ ± 3.7×10⁻⁶
CIFAR-100	9.89% ± 5.11%	82.6 ± 43.8	8.1×10⁻⁷ ± 4.5×10⁻⁷

Empirical findings include:

AdaMax generally produces top or near-top accuracy on moderate-complexity tasks (best on CIFAR-10).
Training duration is among the slowest due to more conservative steps imposed by the ∞-norm scaling.
CO₂ emissions are highest on tasks where long convergence times are required (notably in CIFAR-10); emission levels are middling for simpler tasks.
Efficiency (accuracy per time or kg of emissions) is lower than AdamW, NAdam, or even SGD in most regimes.

The results indicate that although AdaMax is robust and often reaches high accuracy, its environmental footprint and runtime should be weighed in practice (Almog, 16 Sep 2025).

AdaMax’s use of the L∞-norm normalization differentiates it operationally from Adam and RMSProp, both of which rely on exponentially weighted L2-norm statistics:

Resilience to Outliers: AdaMax’s per-coordinate max operation prevents the denominator from collapsing after spiky gradients, preserving the optimizer’s ability to make meaningful updates without indefinite attenuation (Kingma et al., 2014, Shulman, 2023).
Simplicity: No bias correction is required for $i$ 5 (max operation’s exponential decay can only be towards established maxima); no square roots are required, in contrast to Adam and RMSProp.
Potential Drawbacks: The ∞-norm may be dominated by one large coordinate, causing permanently reduced step sizes there (“over-penalization”), potentially making adaptation too conservative for some tasks. Adam’s RMS approach can adapt more finely in these cases (Ruder, 2016, Shulman, 2023).

Table: Adam vs. AdaMax Update Differences

Feature	Adam	AdaMax
2nd moment estimate	$i$ 6 (L2)	$i$ 7 (L∞)
Denominator bias corr.	Yes	No
Step size after outlier	Can become very small	Decays but cannot collapse
Convergence bounds	$i$ 8	$i$ 9

6. Convergence Properties and Theoretical Guarantees

Kingma & Ba established that AdaMax maintains the $u_{t,i} = \max(\beta_2 u_{t-1,i}, |g_{t,i}|)$ 0 regret bound typical of adaptive first-order optimizers in the online convex optimization framework under standard assumptions (bounded, appropriately decaying moments) (Kingma et al., 2014). The elemental properties of the max operation in $u_{t,i} = \max(\beta_2 u_{t-1,i}, |g_{t,i}|)$ 1 satisfy the necessary non-increasing weighting structure for analytic convergence. Omission of the second-moment bias correction in AdaMax does not impact convergence guarantees, and no degradation was observed empirically or theoretically (Kingma et al., 2014, Shulman, 2023, Ruder, 2016).

7. Best Practices and Use Cases

Practitioners are advised to consider AdaMax primarily when:

Training with extremely large, sparse, or bursty gradients (notably in deep RNNs or scenarios with significant unstable dynamics).
Experiencing stalling or non-recovery in Adam after rare gradient spikes (Shulman, 2023).
Seeking maximal numerical stability at the expense of slower convergence and increased energy use (Almog, 16 Sep 2025).

For routine or resource-constrained tasks, other optimizers (notably AdamW or NAdam) usually offer better speed and efficiency with similar or better generalization (Almog, 16 Sep 2025). AdaMax serves as a specialist optimizer for stability-critical settings rather than a default option for energy-efficient, large-scale workloads.

Key references:

D. P. Kingma and J. L. Ba, "Adam: A Method for Stochastic Optimization" (Kingma et al., 2014)
"Optimization Methods in Deep Learning: A Comprehensive Overview" (Shulman, 2023)
"An overview of gradient descent optimization algorithms" (Ruder, 2016)
"An Analysis of Optimizer Choice on Energy Efficiency and Performance in Neural Network Training" (Almog, 16 Sep 2025)