Training-free Neural Architecture Search through Variance of Knowledge of Deep Network Weights (2502.04975v1)

Published 7 Feb 2025 in cs.CV

Abstract: Deep learning has revolutionized computer vision, but it achieved its tremendous success using deep network architectures which are mostly hand-crafted and therefore likely suboptimal. Neural Architecture Search (NAS) aims to bridge this gap by following a well-defined optimization paradigm which systematically looks for the best architecture, given objective criterion such as maximal classification accuracy. The main limitation of NAS is however its astronomical computational cost, as it typically requires training each candidate network architecture from scratch. In this paper, we aim to alleviate this limitation by proposing a novel training-free proxy for image classification accuracy based on Fisher Information. The proposed proxy has a strong theoretical background in statistics and it allows estimating expected image classification accuracy of a given deep network without training the network, thus significantly reducing computational cost of standard NAS algorithms. Our training-free proxy achieves state-of-the-art results on three public datasets and in two search spaces, both when evaluated using previously proposed metrics, as well as using a new metric that we propose which we demonstrate is more informative for practical NAS applications. The source code is publicly available at http://www.github.com/ondratybl/VKDNW

Summary

The paper introduces Variance of Knowledge of Deep Network Weights (VKDNW), a novel training-free proxy for estimating network accuracy based on Fisher Information to overcome the computational cost of traditional Neural Architecture Search.
VKDNW measures the diversity of Fisher Information Matrix eigenvalues, which indicates how easily optimal network weights can be estimated, providing information orthogonal to network size.
Evaluations show VKDNW achieves state-of-the-art results among training-free methods, particularly when assessed with Normalized Discounted Cumulative Gain (nDCG), and performs effectively even with randomly generated input data.

This paper introduces a novel training-free Neural Architecture Search (NAS) method called Variance of Knowledge of Deep Network Weights (VKDNW). The primary goal is to overcome the significant computational cost associated with traditional NAS, which requires training numerous candidate network architectures from scratch.

Here's a breakdown of the key aspects of the paper:

Problem: Traditional NAS is computationally expensive because it needs to train each candidate architecture to evaluate its performance.

Proposed Solution: The authors propose a training-free proxy for image classification accuracy based on Fisher Information. This proxy, VKDNW, allows estimating the expected image classification accuracy of a deep network without actually training it.

Methodology:

Fisher Information Theory: The method builds on Fisher Information theory, analyzing the Fisher Information Matrix (FIM) of the network weights. The FIM provides information about the difficulty of estimating the network's optimal weights. The authors frame network architecture search in terms of how easily the optimal network weights can be estimated.
VKDNW Proxy: VKDNW measures the diversity of the FIM eigenvalues. A high VKDNW indicates that the uncertainty in estimating different weight combinations is similar, and no specific weight directions significantly influence the network's predictions more than others. This is interpreted as a more "balanced" and potentially better-performing network.
Efficient FIM Estimation: The authors address the computational challenges of calculating the FIM for large networks. This includes overcoming numerical instability and making the eigenvalue computation tractable.
Ranking: VKDNW is combined with a proxy for network size (number of layers with weights) to rank networks for NAS.

Evaluation Metric: The authors propose using Normalized Discounted Cumulative Gain (nDCG) in addition to standard metrics like Kendall's $\tau$ and Spearman's $\rho$ . They argue that nDCG is more relevant for NAS because it focuses on the ability of the proxy to identify good networks, rather than penalizing inaccuracies for less interesting (low-performing) networks.

Contributions:

A novel algorithm for estimating the Fisher Information Matrix spectrum for large deep networks, addressing numerical stability issues.
A new principled VKDNW proxy for image classification accuracy based on Fisher Information, capturing uncertainty in weight estimation. The proxy is designed to be independent of network size, providing information orthogonal to model capacity.
A proposal to use Normalized Discounted Cumulative Gain (nDCG) as a more relevant evaluation metric for TF-NAS proxies, focusing on identifying good networks.

Experiments and Results:

The method is evaluated on three public datasets (CIFAR-10, CIFAR-100, and ImageNet16-120) and in two search spaces (NAS-Bench-201 and MobileNetV2).
VKDNW achieves state-of-the-art results compared to other training-free NAS methods, particularly when evaluated using nDCG.
Ablation studies demonstrate the orthogonality of VKDNW to network size and the importance of its components.
Experiments show that the method performs well using randomly generated input data, making it applicable even when real datasets are unavailable.

Related Work: The paper discusses various existing zero-shot NAS methods, highlighting the advantages of VKDNW over them.

Conclusion: The paper concludes that VKDNW provides a promising approach to training-free NAS, offering strong theoretical foundations and state-of-the-art performance. The method provides an information orthogonal to the network size leading to a zero-cost ranking were contribution of the network size and architecture feasibility are separated. It also advocates for the use of nDCG as a more appropriate metric for evaluating NAS proxies.

The supplementary material elaborates on the theoretical arguments, including the Cramer-Rao bound and Natural Gradient Descent, and provides additional experimental results and ablations.