Papers
Topics
Authors
Recent
2000 character limit reached

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

Published 1 Jun 2020 in cs.LG, cs.NA, math.NA, and stat.ML | (2006.00719v3)

Abstract: We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN. Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and Adam. The main disadvantage of traditional second order methods is their heavier per iteration computation and poor accuracy as compared to first order methods. To address these, we incorporate several novel approaches in ADAHESSIAN, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; (ii) a root-mean-square exponential moving average to smooth out variations of the Hessian diagonal across different iterations; and (iii) a block diagonal averaging to reduce the variance of Hessian diagonal elements. We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods, including variants of Adam. In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that ADAHESSIAN: (i) achieves 1.80%/1.45% higher accuracy on ResNets20/32 on Cifar10, and 5.55% higher accuracy on ImageNet as compared to Adam; (ii) outperforms AdamW for transformers by 0.13/0.33 BLEU score on IWSLT14/WMT14 and 2.7/1.0 PPL on PTB/Wikitext-103; (iii) outperforms AdamW for SqueezeBert by 0.41 points on GLUE; and (iv) achieves 0.032% better score than Adagrad for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the cost per iteration of ADAHESSIAN is comparable to first order methods, and that it exhibits robustness towards its hyperparameters.

Citations (251)

Summary

  • The paper presents AdaHessian, a novel optimizer that adapts second-order Hessian estimation to overcome limitations of first-order methods.
  • It employs Hutchinson's method, RMS exponential moving averages, and block diagonal averaging to stabilize curvature estimates with minimal overhead.
  • Experimental results show improved performance in image classification, NLP BLEU scores, and recommendation systems compared to Adam and related optimizers.

AdaHessian: An Adaptive Second-Order Optimizer for Machine Learning

Introduction

The paper introduces AdaHessian, a novel second-order optimization algorithm designed specifically for machine learning tasks. The objective of AdaHessian is to overcome the limitations typically associated with first-order optimization methods such as SGD and Adam, which are commonly used in training NN models. The primary challenges these optimizers face include hyperparameter sensitivity and inefficiency due to ignorance of curvature information in the loss landscape. To address these issues, AdaHessian leverages adaptive Hessian-based preconditioning to normalize and optimize the learning process efficiently.

Approach and Methodology

The core innovation of AdaHessian lies in three computational strategies:

  1. Hessian Diagonal Approximation:

    AdaHessian employs a Hutchinson-based method to approximate the curvature matrix, allowing the capture of essential second-order information without significant computational overhead. This approximation results in the Hessian being treated as a diagonal operator, which simplifies computation. Figure 1

    Figure 1: Illustration of the diagonal Hessian estimation with Hutchinson's method.

  2. Root-Mean-Square Exponential Moving Average:

    To smooth out variations of the Hessian diagonal across different iterations, AdaHessian incorporates an RMS-based exponential moving average technique. This approach mitigates issues caused by noisy local curvature estimates, as illustrated by local versus global curvature analysis. Figure 2

    Figure 2: Local versus global curvature. Demonstrates local curvature's noise impact and the stabilizing effect of exponential moving averages.

  3. Block Diagonal Averaging:

    Variance in the Hessian diagonal is further reduced through block diagonal averaging, a technique that averages selected blocks of parameters to provide more stable curvature estimates across spatial dimensions in NNs. Figure 3

    Figure 3: Block size averaging in AdaHessian reduces spatial variance in the Hessian diagonal.

Experimental Results

The efficacy of AdaHessian is demonstrated across a range of tasks including CV, NLP, and recommendation systems, outperforming existing optimization methods like Adam and AdamW:

  • Image Classification:
    • For Cifar10 and ImageNet datasets, AdaHessian achieves up to 1.80% higher accuracy on ResNets compared to Adam.

(Figure 4, Figure 5)

Figure 4: Gradient descent vs. AdaHessian on 2D function. Figure 5: Accuracy curves on Criteo dataset, indicating AdaHessian's performance advantage.

  • Natural Language Processing:
    • AdaHessian improves BLEU scores by margins of 0.13/0.33 on IWSLT14/WMT14 tasks.
  • Recommendation Systems:
    • Utilized in DLRM models on the Criteo Ad Kaggle dataset, AdaHessian improves accuracy by 0.032% over Adagrad, a notable enhancement in this field.

Implications and Future Work

The introduction of AdaHessian signifies a promising advancement in optimizing NN training efficiency and accuracy. The inclusion of second-order adaptations adjusts traditional learning paradigms, offering a robust alternative to prevalent methods with justifiable computational costs.

Technologically, the results suggest broad applicability and suggest further exploration into adaptive approaches that can leverage second-order insights without substantial computational trade-offs. Future developments could refine the existing model and further reduce overhead, increasing feasibility in industrial scale applications.

Conclusion

AdaHessian represents a significant contribution to the optimization landscape, effectively bridging gaps left by first-order methods through adaptive second-order techniques. Its robustness and efficiency across varied tasks underline the potential for broader adoption in ML, encouraging continued research into similar methodologies. The public availability of the AdaHessian code supports its integration within existing ML workflows, fostering innovation and exploration in complex optimization contexts.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.