NAS evaluation is frustratingly hard (1912.12522v3)

Published 28 Dec 2019 in cs.LG, cs.CV, and stat.ML

Abstract: Neural Architecture Search (NAS) is an exciting new field which promises to be as much as a game-changer as Convolutional Neural Networks were in 2012. Despite many great works leading to substantial improvements on a variety of tasks, comparison between different methods is still very much an open issue. While most algorithms are tested on the same datasets, there is no shared experimental protocol followed by all. As such, and due to the under-use of ablation studies, there is a lack of clarity regarding why certain methods are more effective than others. Our first contribution is a benchmark of $8$ NAS methods on $5$ datasets. To overcome the hurdle of comparing methods with different search spaces, we propose using a method's relative improvement over the randomly sampled average architecture, which effectively removes advantages arising from expertly engineered search spaces or training protocols. Surprisingly, we find that many NAS techniques struggle to significantly beat the average architecture baseline. We perform further experiments with the commonly used DARTS search space in order to understand the contribution of each component in the NAS pipeline. These experiments highlight that: (i) the use of tricks in the evaluation protocol has a predominant impact on the reported performance of architectures; (ii) the cell-based search space has a very narrow accuracy range, such that the seed has a considerable impact on architecture rankings; (iii) the hand-designed macro-structure (cells) is more important than the searched micro-structure (operations); and (iv) the depth-gap is a real phenomenon, evidenced by the change in rankings between $8$ and $20$ cell architectures. To conclude, we suggest best practices, that we hope will prove useful for the community and help mitigate current NAS pitfalls. The code used is available at https://github.com/antoyang/NAS-Benchmark.

Authors (3)

Antoine Yang (12 papers)
Pedro M. Esperança (10 papers)
Fabio M. Carlucci (3 papers)

Citations (161)

View on Semantic Scholar

Summary

The paper establishes a structured benchmark by evaluating eight NAS methods across five vision datasets using a random architecture baseline.
The study finds that many NAS methods achieve minimal gains over random baselines, indicating that training protocols often drive performance more than architecture innovations.
The analysis reveals that a network’s macro-structure significantly influences outcomes compared to its micro-structure, underscoring the need for more reproducible and expressive NAS evaluations.

Insights into Neural Architecture Search Evaluation Challenges

The paper "NAS Evaluation is Frustratingly Hard" critically analyzes the evaluation process of Neural Architecture Search (NAS) strategies, exposing several challenges that hamper fair comparison and assessment of different methods. Authored by Antoine Yang, Pedro M. Esperança, and Fabio Maria Carlucci, the paper provides a comprehensive benchmark of eight NAS methods over five datasets, revealing underlying complexities that complicate the evaluation landscape.

Overview of Key Contributions

The paper's primary contribution lies in establishing a structured benchmark for NAS evaluation. The authors scrutinize eight NAS strategies—DARTS, StacNAS, PDARTS, MANAS, CNAS, NSGANET, ENAS, and NAO—across five distinct computer vision datasets: CIFAR10, CIFAR100, SPORT8, MIT67, and FLOWERS102. A novel approach suggested by the authors involves measuring a method’s relative improvement over an average, randomly sampled architecture from the search space, which aims to strip away biases stemming from manually crafted search spaces and training protocols.

Significant Findings

Several intriguing findings emerge from this paper:

Minimal Gains Over Baselines: A surprising result is that many NAS methods struggle to outperform the average architecture baseline. This raises questions about the real efficacy of the search strategies employed and suggests that architectural advancements might not always translate to significant performance improvements.
Impact of Training Protocols: The paper highlights the substantial influence of training protocols on reported architecture performance. Techniques like Cutout, DropPath, and extended training epochs often contribute more to the performance gains than the architectural innovations themselves.
Macro-Structure vs. Micro-Structure: The analysis indicates that the hand-designed macro-structure of a neural network (the overarching connection pattern of cells) has a more profound impact on performance than the micro-structure (the specific operations within cells), challenging the focus on operation-level optimization prevalent in many NAS approaches.
Depth-Gap Phenomenon: A depth-gap manifesting in changed rankings between architectures of differing cell depths (e.g., 8 vs. 20 cells) underscores the sensitivity of performance to network depth rather than search efficacy.

Implications for NAS Research

The findings pose considerable implications for the future development of NAS methodologies. Primarily, there is a call for more expressive search spaces that aren't unduly constrained by existing expert knowledge, potentially uncovering more innovative architectural solutions. Furthermore, reproducibility in NAS is underscored as a critical factor—authors should provide comprehensive details, including seeds and training protocols, to facilitate objective comparison and validation.

Recommendations for Best Practices

Toward mitigating identified pitfalls, the paper suggests the following best practices:

Balanced Reporting: Researchers should report both results with and without augmentation tricks to fairly assess the search strategy's contribution.
Diverse Datasets for Evaluation: To prevent overfitting to specific datasets like CIFAR10, NAS methods should be evaluated across diverse tasks with varying complexities.
Attention to Hyperparameter Tuning: The computational cost of hyperparameter tuning should be recognized as part of the search metric to reflect true efficiency.
Ablation Studies: Comprehensive ablation studies can elucidate individual elements' contributions within the NAS pipeline, fostering a better understanding of core performance drivers.

Conclusion and Future Directions

In conclusion, the paper provides valuable insights into the complex landscape of NAS evaluation, emphasizing the pitfalls leading to potentially misleading assessments of NAS advancements. Future explorations might delve into developing more robust, task-agnostic NAS solutions, encouraging holistic methodologies that integrate broader architectures and training paradigms. This research underscores the ongoing imperative for methodical, transparent approaches in the dynamic field of neural architecture exploration.

PDF Markdown

Related Papers

GitHub

GitHub - antoyang/NAS-Benchmark: [ICLR 2020] NAS evaluation is frustratingly hard (149 stars)