A Kronecker-factored approximate Fisher matrix for convolution layers (1602.01407v2)

Published 3 Feb 2016 in stat.ML and cs.LG

Abstract: Second-order optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function. Unfortunately, the exact natural gradient is impractical to compute for large models, and most approximations either require an expensive iterative procedure or make crude approximations to the curvature. We present Kronecker Factors for Convolution (KFC), a tractable approximation to the Fisher matrix for convolutional networks based on a structured probabilistic model for the distribution over backpropagated derivatives. Similarly to the recently proposed Kronecker-Factored Approximate Curvature (K-FAC), each block of the approximate Fisher matrix decomposes as the Kronecker product of small matrices, allowing for efficient inversion. KFC captures important curvature information while still yielding comparably efficient updates to stochastic gradient descent (SGD). We show that the updates are invariant to commonly used reparameterizations, such as centering of the activations. In our experiments, approximate natural gradient descent with KFC was able to train convolutional networks several times faster than carefully tuned SGD. Furthermore, it was able to train the networks in 10-20 times fewer iterations than SGD, suggesting its potential applicability in a distributed setting.

Citations (248)

View on Semantic Scholar

Summary

The paper introduces KFC, which decomposes the Fisher matrix for convolution layers into a Kronecker product for efficient natural gradient descent.
It demonstrates that KFC accelerates CNN training by achieving 10–20 times fewer iterations compared to optimized SGD with momentum.
The approach preserves key invariance properties under reparameterizations, underpinning both its theoretical justification and practical robustness.

Analysis of "A Kronecker-factored approximate Fisher matrix for convolution layers"

The paper presents Kronecker Factors for Convolution (KFC), a novel method for efficiently approximating the Fisher matrix in convolutional neural networks (CNNs). Building on the existing Kronecker-Factored Approximate Curvature (K-FAC) method, KFC addresses the challenges of applying second-order optimization to convolutional layers, which are prevalent in modern neural network architectures but not handled by earlier methods focusing on fully connected layers.

Key Contributions

KFC introduces a tractable approximation to the Fisher matrix specific to convolutional layers. The underlying probabilistic model assumes independence between activations and derivatives, spatial homogeneity, and uncorrelated derivatives across spatial locations. This structure enables the decomposition of each block in the Fisher matrix into a Kronecker product of smaller matrices. Such factorization significantly reduces computational complexity, allowing these matrices to be inverted efficiently, which is crucial for calculating approximate natural gradients.

The paper demonstrates that KFC retains the invariant properties of natural gradient descent under certain reparameterizations, such as affine transformations before and after nonlinear activation functions. This aspect preserves the benefits of popular techniques like whitening and centering without explicitly applying them.

Numerical Results and Implications

Experimental validation shows that using KFC for approximate natural gradient descent can train CNNs several times faster than carefully tuned stochastic gradient descent (SGD) with momentum, requiring 10-20 times fewer training iterations. These results suggest that KFC can be particularly advantageous for distributed training environments, where reducing the number of iterations corresponds to less communication overhead and faster convergence.

KFC's superior iteration efficiency opens new avenues for distributed deep learning, aligning with the trend of leveraging large datasets and extensive computational resources. The method's compatibility with large mini-batch sizes further suggests practical benefits for scaling deep learning frameworks.

Theoretical Evaluation

KFC's probabilistic approximations are rigorously justified and empirically evaluated. The assumption of spatially uncorrelated derivatives holds well for the networks examined, especially noting the role of max-pooling in sparsifying gradients. While the paper doesn't extensively explore activation correlations, it acknowledges potential limitations if activations have strong spatial correlations.

Impact and Future Directions

By extending K-FAC to convolution layers, KFC enhances the applicability of second-order optimization methods across a broader range of architectures and tasks. This work could catalyze further research on efficient approximations for other neural network components, such as recurrent layers or attention mechanisms, extending second-order methods' benefits throughout various neural architectures.

Future research might focus on integrating KFC with other advanced optimization techniques or exploring its synergy with architectural modifications like batch normalization. Such combinations could further stabilize training and enhance generalization, especially in modern, highly intricate neural network designs.

In conclusion, KFC is a promising advancement for optimizing convolutional neural networks, providing a foundation for more efficient and scalable training practices. As computational resources become increasingly leveraged in distributed settings, methods like KFC will be integral to capitalizing on these opportunities.

PDF Markdown