ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning (2006.00719v3)

Published 1 Jun 2020 in cs.LG, cs.NA, math.NA, and stat.ML

Abstract: We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN. Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and Adam. The main disadvantage of traditional second order methods is their heavier per iteration computation and poor accuracy as compared to first order methods. To address these, we incorporate several novel approaches in ADAHESSIAN, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; (ii) a root-mean-square exponential moving average to smooth out variations of the Hessian diagonal across different iterations; and (iii) a block diagonal averaging to reduce the variance of Hessian diagonal elements. We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods, including variants of Adam. In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that ADAHESSIAN: (i) achieves 1.80%/1.45% higher accuracy on ResNets20/32 on Cifar10, and 5.55% higher accuracy on ImageNet as compared to Adam; (ii) outperforms AdamW for transformers by 0.13/0.33 BLEU score on IWSLT14/WMT14 and 2.7/1.0 PPL on PTB/Wikitext-103; (iii) outperforms AdamW for SqueezeBert by 0.41 points on GLUE; and (iv) achieves 0.032% better score than Adagrad for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the cost per iteration of ADAHESSIAN is comparable to first order methods, and that it exhibits robustness towards its hyperparameters.

Citations (251)

View on Semantic Scholar

Summary

The paper introduces AdaHessian, which leverages adaptive Hessian approximations to combine first- and second-order optimization benefits.
The methodology employs Hutchinson-based curvature estimation, RMS exponential moving average smoothing, and block diagonal averaging to reduce computational noise.
Empirical evaluations demonstrate that AdaHessian improves accuracy by up to 5.55% over optimizers like Adam on various machine learning tasks.

AdaHessian: An Adaptive Second Order Optimizer for Machine Learning

The paper, "AdaHessian: An Adaptive Second Order Optimizer for Machine Learning," introduces a nuanced optimizer known as AdaHessian. This optimizer emerges from the intersection of first-order and second-order optimization techniques, aiming to leverage the advantages of Hessian information while mitigating its common computational overhead challenges.

AdaHessian is built upon the concept of adaptive estimates of the Hessian. Traditional second-order methodologies provide robust convergence properties compared to first-order methods like SGD and Adam but suffer from higher computational cost per iteration. To circumvent these limitations, the authors propose several innovative strategies.

Key Contributions

Curvature Matrix Approximation: AdaHessian employs a Hutchinson-based technique to approximate the Hessian's curvature matrix efficiently. This is crucial in maintaining a low computational footprint while still leveraging second-order advantages.
Hessian Smoothing: A root-mean-square exponential moving average is integrated to alleviate variations in the Hessian diagonal across different iterations. This stabilization is critical given the notoriously noisy local curvature in large-scale optimization problems.
Block Diagonal Averaging: Further reducing variance in the Hessian diagonal elements, this approach ensures the optimizer is less susceptible to stochastic noise, crucially improving performance on rugged loss landscapes.

Performance Analysis

The paper presents empirical evaluations across multiple domains including computer vision (CV), NLP, and recommendation systems. AdaHessian demonstrates significant performance improvements over established optimizers such as Adam and AdamW. Notably:

On CV tasks, it achieves up to 5.55% higher accuracy on ImageNet compared to Adam.
For NLP transformer tasks, it outperforms AdamW by margins of 0.13 to 0.33 in BLEU scores for IWSLT14/WMT14.
It shows a slight edge over Adagrad for recommendation systems in the Criteo AD Kaggle dataset.

Moreover, the optimizer maintains a computational cost per iteration similar to first-order methods but with superior robustness towards hyperparameters — a critical property for practical deployment across diverse AI applications.

Implications and Future Prospects

The implications of AdaHessian span both theoretical and practical landscapes. Theoretically, it opens avenues for constructing more efficient second-order optimizers that balance computational cost and convergence speed. Practically, its robustness in performance and lower sensitivity to hyperparameter tuning suggest potential widespread adoption in various machine learning tasks, from image classification to complex LLMs.

Future developments could explore further approximations that extend beyond the diagonal of the Hessian or leverage advancements in matrix-free methods for large-scale applications. Another area of progress could be in integrating AdaHessian with emerging architecture-specific optimizations, thus universalizing its advantages across an even broader range of neural network architectures.

Overall, AdaHessian signifies a notable step forward in the application of second-order optimization techniques in machine learning, promising enhancements in both efficiency and accuracy across a wide array of tasks.

PDF Markdown