Nadam: Nesterov-accelerated Adam

Updated 6 April 2026

Nadam is an adaptive optimization algorithm that integrates Adam's moment estimation with Nesterov's look-ahead updates for improved convergence.
Its pseudocode and mathematical formulation ensure rigorous bias correction and effective learning rate adaptation in deep networks.
Empirical results show Nadam frequently outperforms Adam and RMSProp, especially on non-convex, noisy, and sparse gradient problems.

Nadam (Nesterov-accelerated Adaptive Moment Estimation) is an adaptive optimization algorithm that combines the adaptive moment techniques of Adam with the anticipatory "look-ahead" property of Nesterov-accelerated gradient (NAG) methods. By integrating these two strategies, Nadam achieves faster and more stable convergence, especially in deep and complex models characterized by highly non-convex loss surfaces. Nadam is widely used in modern deep learning libraries and architectures, often with minimal hyperparameter tuning, to accelerate training across a variety of applications (Ruder, 2016, Shulman, 2023, Ifeanyi et al., 2024, Jiang et al., 2023).

1. Algorithmic Foundation and Derivation

Nadam extends the Adam optimizer by modifying the momentum term with a Nesterov-style "look-ahead" update. In Adam, two exponential moving averages are maintained:

The first moment estimate (mean of gradients): $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$
The second moment estimate (uncentered variance): $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$

Bias correction is applied since $m_0 = 0$ and $v_0 = 0$ :

$\hat m_t = m_t / (1 - \beta_1^t)$
$\hat v_t = v_t / (1 - \beta_2^t)$

Adam’s parameter update is:

$\theta_{t+1} = \theta_t - \eta \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}$

Nadam replaces the standard first-moment correction in Adam with a Nesterov "look-ahead" update, incorporating the anticipated next gradient:

$\tilde m_t = \beta_1 \hat m_{t-1} + \frac{1-\beta_1}{1-\beta_1^t} g_t$

The full Nadam parameter update becomes:

$\theta_{t+1} = \theta_t - \eta \frac{\tilde m_t}{\sqrt{\hat v_t} + \epsilon}$

Alternatively, several implementations use:

$\tilde m_t = \beta_1 \hat m_t + \frac{1-\beta_1}{1-\beta_1^t} g_t$

with the update rule remaining structurally similar (Ruder, 2016, Shulman, 2023, Jiang et al., 2023).

2. Mathematical Formulation and Pseudocode

Precise LaTeX-formulated update rules, with $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ 0: $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ 1 Canonical pseudocode for Nadam follows:

$v_0 = 0$ 5 (Ruder, 2016, Shulman, 2023).

3. Hyperparameters and Convergence Guarantees

Nadam uses the same four principal hyperparameters as Adam:

Learning rate $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ 2: Typically $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ 3 (Keras often uses $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ 4); robust to moderate scaling.
First moment decay $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ 5: Default $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ 6; higher values increase momentum memory, lower values reduce inertia.
Second moment decay $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ 7: Default $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ 8; controls smoothness of adaptive rescaling.
Numerical stability $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ 9: Default $m_0 = 0$ 0 (sometimes as low as $m_0 = 0$ 1); prevents division by zero.

Convergence analysis (under the UAdam framework) establishes that Nadam, with bias corrections and appropriate learning rates, exhibits an $m_0 = 0$ 2 convergence rate to a neighborhood of stationary points. Crucially, only the first-moment parameter $m_0 = 0$ 3 needs to be close to 1 for convergence; $m_0 = 0$ 4 can be fixed (e.g., at typical practical values) without harming asymptotic behavior. The residual neighborhood's size contracts as $m_0 = 0$ 5, providing a direct theoretical explanation for practical stability and efficiency with default settings (Jiang et al., 2023).

If the problem is characterized by very noisy gradients, lowering $m_0 = 0$ 6 (e.g., to $m_0 = 0$ 7) reduces reliance on stale momentum. Highly sparse or abruptly changing losses may benefit from a slightly lower $m_0 = 0$ 8 (e.g., $m_0 = 0$ 9), improving responsiveness but increasing variance (Ruder, 2016, Shulman, 2023).

4. Empirical Performance and Comparative Analysis

Empirical studies, including standard benchmarks and fault isolation/classification in industrial anomaly detection, consistently show that Nadam matches or outperforms Adam and RMSProp in convergence rate, especially during early training or on highly non-convex objectives.

For example, in control rod drive diagnostics using deep 1D CNN autoencoders and classifiers:

Nadam achieved the lowest median validation loss across 30 runs for isolation (reconstruction error).
Nadam converged fastest (often reaching minima in $v_0 = 0$ 0 epochs) and yielded lower ultimate loss compared to Adam, RMSProp, and SGD.
For classification, Adam and Nadam both achieved 100% test accuracy, with Nadam exhibiting tighter distributions of loss across runs.
RMSProp sometimes matched Nadam’s speed but settled at higher loss; SGD was slower and less reliable unless extensively tuned (Ifeanyi et al., 2024).

On standard benchmarks (e.g., CIFAR-10), Nadam achieves 80–90% of peak accuracy in approximately 10% fewer epochs than Adam. However, improvements in final model quality are often moderate, and in recurrent language modeling, learning rate scheduling can diminish the observed differences (Shulman, 2023).

5. Strengths, Limitations, and Best Practices

Strengths

Faster and more stable convergence, particularly in deep feed-forward, convolutional, and recurrent network architectures.
Retains Adam's robustness to learning rate scheduling and effectiveness with sparse, noisy, or nonstationary gradients.
Minimal hyperparameter tuning is required for strong baseline performance in diverse practical settings (Ruder, 2016, Ifeanyi et al., 2024).

Limitations

Marginal additional computational overhead per iteration due to the extra "look-ahead" vector addition.
Empirical gains over Adam may be modest on certain tasks, and in some cases, Adam may suffice.
Excessively large or small $v_0 = 0$ 1 or inappropriate decay rates can degrade stability (Ruder, 2016, Shulman, 2023).

Best Practices

Nadam is a suitable "first try" optimizer for deep networks; start with default parameters, then perform minimal tuning if needed.
Use decoupled weight decay rather than classical $v_0 = 0$ 2 penalties to avoid distorting moment estimates.
Warm restarts with cosine annealing or cyclical learning rates can further accelerate convergence.
For highly sparse or variable data, consider short “warmup” periods or slightly elevated $v_0 = 0$ 3 values (Shulman, 2023).

6. Application Domains and Practical Recommendations

Nadam is deployed across a spectrum of deep learning applications, including:

Deep convolutional neural networks for image classification
Recurrent neural networks for sequence modeling and natural language processing
Encoder–decoder and autoencoder architectures for anomaly detection in time-series and control systems

In applied research on control rod drive fault detection, Nadam was preferred for its consistently fast and low-loss convergence in high-dimensional, nonstationary, and noisy environments. Recommendations from empirical benchmarking include:

Use Nadam with Keras default settings for efficient training of autoencoders in detection tasks.
For classification, Nadam competes with Adam, RMSProp, and—after extensive tuning—SGD.
Always average results across multiple independent initializations (at least 20–30) to fairly compare optimizer performance (Ifeanyi et al., 2024).

7. Theoretical Guarantees and Unified Optimizer View

From the perspective of algorithmic unification, Nadam is a special case of the UAdam algorithmic framework, corresponding to an interpolation parameter $v_0 = 0$ 4 and the standard Adam-style variance estimate. The UAdam analysis provides a non-convex convergence guarantee for Nadam under weak assumptions, showing that only the first-moment coefficient must be near unity for convergence, with no restrictive requirements on the second-moment factor. This insight not only explains Nadam’s practical reliability but also guides the tuning of related algorithms (Jiang et al., 2023).

In summary, Nadam represents a theoretically sound and empirically validated extension of Adam, efficiently blending adaptive step-size selection and Nesterov’s look-ahead momentum, and offers robust performance across a wide range of non-convex learning scenarios (Ruder, 2016, Shulman, 2023, Ifeanyi et al., 2024, Jiang et al., 2023).