M-FAC: Efficient Matrix-Free Approximations of Second-Order Information (2107.03356v5)

Published 7 Jul 2021 in cs.LG

Abstract: Efficiently approximating local curvature information of the loss function is a key tool for optimization and compression of deep neural networks. Yet, most existing methods to approximate second-order information have high computational or storage costs, which can limit their practicality. In this work, we investigate matrix-free, linear-time approaches for estimating Inverse-Hessian Vector Products (IHVPs) for the case when the Hessian can be approximated as a sum of rank-one matrices, as in the classic approximation of the Hessian by the empirical Fisher matrix. We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension $d$, if the Hessian is given as a sum of $m$ rank-one matrices, using $O(dm^2)$ precomputation, $O(dm)$ cost for computing the IHVP, and query cost $O(m)$ for any single element of the inverse Hessian. The second algorithm targets an optimization setting, where we wish to compute the product between the inverse Hessian, estimated over a sliding window of optimization steps, and a given gradient direction, as required for preconditioned SGD. We give an algorithm with cost $O(dm + m^2)$ for computing the IHVP and $O(dm + m^3)$ for adding or removing any gradient from the sliding window. These two algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods. Implementations are available at [9] and [17].

Citations (53)

View on Semantic Scholar

Summary

Efficient Matrix-Free Approximations of Second-Order Information

The paper in discussion presents two innovative algorithms aimed at efficiently computing Inverse-Hessian Vector Products (IHVPs), particularly in scenarios where the Hessian matrix of a deep neural network’s loss function can be approximated as a sum of rank-one matrices. This approach, known as the empirical Fisher approximation, provides a feasible alternative to direct computation, which is impractical for large-scale networks due to exorbitant storage requirements.

Key Contributions

The first algorithm, designed for static settings, primarily addresses the computational challenges associated with neural network pruning. By leveraging the empirical Fisher matrix, the authors propose a method that can precompute data in time complexity of $O(dm^2)$ and subsequently perform IHVP computations in $O(dm)$ time, where $d$ denotes the model's dimensionality and $m$ represents the number of gradient samples used for approximation. This static algorithm allows querying of individual elements of the inverse Hessian efficiently, with a query cost of $O(m)$ .

The second algorithm adapts this framework for dynamic settings, specifically in the context of optimization tasks requiring preconditioned stochastic gradient descent (SGD). By continuously updating the inverse Hessian estimate, the method computes IHVPs over a sliding window of recent optimization steps. The complexity for computing IHVPs dynamically is $O(dm + m^2)$ with an additional cost of $O(dm + m^3)$ for refreshing gradient information through the sliding window mechanism.

Numerical Results and Practical Implications

Empirical evaluations demonstrate the efficacy of these matrix-free algorithms in both network pruning and optimization tasks. The pruning tests produced state-of-the-art results when applied to well-established benchmarks such as ImageNet using popular models like ResNet50 and MobileNet. Remarkably, by exploring higher parameter settings, the algorithms enable the fine-grained tuning of sparse models with substantial computational savings over existing methods. This efficiency allows scanning more configurations within feasible memory and time budgets, highlighting practical advantages in real-world applications.

In optimization, the dynamic IHVP computation complements the SGD-based training processes, offering competitive accuracy to leading-edge optimizers while imposing marginal computational overheads. Particularly in convolutional networks and Transformer models, the performance metrics establish these algorithms as viable alternatives, potentially replacing more computationally intensive second-order methods like K-FAC.

Theoretical Impact and Future Directions

The matrix-free nature of the proposed algorithms underscores a significant advancement in computational optimization within deep learning frameworks. They address pressing scalability issues in utilizing second-order information, which is pivotal for various applications, including compression and adaptive optimization. Through a rigorous breakdown of algorithmic complexity and insightful evaluations, the paper provides a comprehensive view of leveraging Hessian approximations under realistic constraints.

Looking forward, potential research avenues could expand this methodology across broader model architectures, further refining approximation techniques. The exploration of gradient compression and parallel gradient storage indicates fruitful paths for improving algorithm scalability in distributed systems, reducing latency and memory demands. Additionally, formalizing a hybrid model combining dynamic and static elements could optimize performance on rapidly evolving tasks.

In conclusion, this paper contributes valuable theoretical insights and practical algorithms to efficiently compute IHVPs, setting a benchmark in the field of matrix-free methods for deep learning optimization and compression.

Related Papers

YouTube

Show All Videos