Papers
Topics
Authors
Recent
2000 character limit reached

Practical Gauss-Newton Optimisation for Deep Learning

Published 12 Jun 2017 in stat.ML | (1706.03662v2)

Abstract: We present an efficient block-diagonal ap- proximation to the Gauss-Newton matrix for feedforward neural networks. Our result- ing algorithm is competitive against state- of-the-art first order optimisation methods, with sometimes significant improvement in optimisation performance. Unlike first-order methods, for which hyperparameter tuning of the optimisation parameters is often a labo- rious process, our approach can provide good performance even when used with default set- tings. A side result of our work is that for piecewise linear transfer functions, the net- work objective function can have no differ- entiable local maxima, which may partially explain why such transfer functions facilitate effective optimisation.

Citations (215)

Summary

  • The paper introduces a recursive block-diagonal approximation to the Gauss-Newton matrix for efficient second-order optimization of feedforward neural networks.
  • A significant finding reveals that neural networks with piecewise linear activations lack differentiable local maxima, potentially simplifying optimization.
  • Empirical validation shows the method can perform competitively or better than well-tuned state-of-the-art first-order methods on standard benchmarks.

Practical Gauss-Newton Optimisation for Deep Learning

The paper "Practical Gauss-Newton Optimisation for Deep Learning" presents a novel approach to optimization of feedforward neural networks, leveraging a block-diagonal approximation to the Gauss-Newton (GN) matrix. The authors aim to improve upon traditional first-order methods by utilizing an efficient second-order optimization technique, which inherently considers the curvature information of the error surface.

Key Contributions and Findings

  1. Recursive Block-Diagonal Approximation: The paper introduces a recursive computation of the block-diagonal approximation of the Hessian matrix, where each block corresponds to the network's individual feedforward layers. This technique enables the efficient computation and inversion of these blocks in a single backward pass. This approximation not only makes the approach computationally efficient but also maintains competitive optimization performance compared to first-order methods.
  2. Piecewise Linear Neural Networks: A significant insight derived as a corollary of the recursive Hessian computation is that networks employing piecewise linear activation functions have error surfaces devoid of differentiable local maxima. This is attributed to the property that these activations lead to zero second derivatives, making the Hessian of the layer positive semi-definite (PSD). This characteristic is proposed to facilitate effective optimization by aiding the curvature analysis of the network's error surface.
  3. Relation to KFAC: The authors draw parallels between their method and the Kronecker-Factored Approximate Curvature (KFAC) method, which uses a block-diagonal approximation of the Fisher matrix. They highlight the distinction that KFAC necessitates the network to define a probabilistic model for its output, whereas their method directly approximates the GN matrix, which applies more broadly. Furthermore, KFAC and GN techniques diverge for non-exponential family models.
  4. Empirical Validation: Through experiments on standard benchmarks, the paper demonstrates that their approach (without extensive hyperparameter tuning) performs well, sometimes even outperforming well-tuned state-of-the-art first-order methods like ADAM. This efficiency is particularly noted in its ability to scale with the complexity of modern datasets and neural architectures, making it viable for practical applications without the computational overhead of tuning numerous hyperparameters.

Implications and Future Directions

The findings in this paper propose a promising direction for the utilization of second-order optimization methods in deep learning, specifically advocating for more efficient ways to approximate the GN matrix. Such methods can fundamentally accelerate training processes and potentially achieve better convergence properties due to the enriched curvature information.

From a theoretical perspective, the insight regarding the non-existence of differentiable local maxima in networks with piecewise linear activations could motivate further exploration into neural architectures and their optimization landscapes. This could influence future design choices for activation functions, promoting those with desired curvature properties to enhance model training efficiency.

Future developments could focus on extending these second-order approximation techniques to more diverse neural network architectures beyond feedforward networks, such as convolutional or recurrent neural networks. Additionally, investigating the integration of such methods into large-scale, parallel computation frameworks could offer considerable gains in practical deployment scenarios.

Conclusion

The paper presents a compelling case for revisiting second-order optimization techniques within deep learning through the lens of block-diagonal approximations of the Gauss-Newton matrix. By demonstrating practical efficiency and theoretical robustness, it lays the groundwork for more refined optimization strategies that leverage the nuances of neural network error surfaces to enhance both convergence speed and final performance accuracy.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.