Hierarchical Neural Architecture Search for Deep Stereo Matching (2010.13501v1)

Published 26 Oct 2020 in cs.CV

Abstract: To reduce the human efforts in neural network design, Neural Architecture Search (NAS) has been applied with remarkable success to various high-level vision tasks such as classification and semantic segmentation. The underlying idea for the NAS algorithm is straightforward, namely, to enable the network the ability to choose among a set of operations (e.g., convolution with different filter sizes), one is able to find an optimal architecture that is better adapted to the problem at hand. However, so far the success of NAS has not been enjoyed by low-level geometric vision tasks such as stereo matching. This is partly due to the fact that state-of-the-art deep stereo matching networks, designed by humans, are already sheer in size. Directly applying the NAS to such massive structures is computationally prohibitive based on the currently available mainstream computing resources. In this paper, we propose the first end-to-end hierarchical NAS framework for deep stereo matching by incorporating task-specific human knowledge into the neural architecture search framework. Specifically, following the gold standard pipeline for deep stereo matching (i.e., feature extraction -- feature volume construction and dense matching), we optimize the architectures of the entire pipeline jointly. Extensive experiments show that our searched network outperforms all state-of-the-art deep stereo matching architectures and is ranked at the top 1 accuracy on KITTI stereo 2012, 2015 and Middlebury benchmarks, as well as the top 1 on SceneFlow dataset with a substantial improvement on the size of the network and the speed of inference. The code is available at https://github.com/XuelianCheng/LEAStereo.

Citations (329)

View on Semantic Scholar

Summary

The paper presents a task-specific hierarchical NAS framework that integrates human insights to efficiently design deep stereo matching architectures.
It jointly optimizes feature extraction and cost volume matching through a three-step pipeline, refining architecture search at both network and cell levels.
LEAStereo achieves state-of-the-art results on benchmarks like KITTI and Middlebury while significantly reducing computational load and parameter size.

Overview of "Hierarchical Neural Architecture Search for Deep Stereo Matching"

The paper "Hierarchical Neural Architecture Search for Deep Stereo Matching" by Xuelian Cheng et al., presents a novel approach to the Neural Architecture Search (NAS) problem specifically applied to the task of deep stereo matching. The authors have effectively bridged the gap between NAS and low-level geometric vision tasks, a domain where NAS had previously not been widely applied due to high computational demands. This research introduces an end-to-end hierarchical NAS framework that incorporates task-specific human insights into the process of designing neural network architectures for stereo matching.

Stereo matching involves finding dense pixel correspondences between rectified pairs of stereo images to estimate a disparity map, a classical problem in computer vision. Traditional methods have been enhanced by the advent of deep learning; however, designing an optimal deep network architecture manually is fraught with difficulty. The ambition of NAS is to reduce human labor in this design process by enabling networks to self-select optimal architectures. This paper's NAS framework is structured hierarchically and emphasizes a three-step pipeline approach to deep stereo matching (feature extraction, feature volume construction, and dense matching), optimizing these components jointly.

Core Contributions and Results

Task-specific Hierarchical NAS Framework: The authors present the first known integration of a hierarchical NAS with a stereo matching pipeline. They leverage existing human knowledge of stereo matching in the NAS framework to avoid computationally prohibitive searches in large architecture spaces. This integration facilitates a more targeted optimization of the architecture to the nuances of stereo matching tasks.
Effective Architecture via Joint Optimization: The NAS framework jointly optimizes the feature extraction net and the cost volume matching net, thereby harmonizing the entire network pipeline—not merely isolated components. The hierarchical structure permits separate searches at both the network and cell levels, significantly refining architecture versatility and efficacy.
SOTA Performance Across Datasets: The resulting architecture, termed LEAStereo, demonstrates superior performance compared to state-of-the-art (SOTA) methods in key benchmarks, including KITTI stereo 2012 and 2015, and the Middlebury datasets. The architecture is significantly more resource-efficient in terms of parameter size and inference speed, achieving these results with a fraction of the parameters and computational load demanded by previous methods.
Evaluation Metrics: LEAStereo achieves top accuracy ranks in the benchmarks with marked improvements in network size and inference speed. Substantial numerical results include top 1 accuracy rankings and reductions in parameter size and runtime, demonstrating the architecture’s improved efficiency and effectiveness.

Implications and Future Directions

The implications of this research are substantial for deep learning and computer vision communities, particularly in the automation of architecture design for low-level vision tasks. The incorporation of human knowledge effectively enhances the search process, making NAS applications feasible even in domains with traditionally high computational demands. The NAS framework also significantly decreases the exploration overhead by focusing on cell-level and network-level searches specific to stereo matching.

From a theoretical perspective, the work demonstrates the viability of combining NAS techniques with specialized human insights, potentially applicable to other dense vision tasks like optical flow estimation and multi-view stereo. Practically, this could lead to more adaptive and efficient AI models capable of solving a broader range of vision problems. The integration of task-specific heuristics into NAS architectures might inspire similar methodologies in other areas of AI, fostering advances in autonomous design systems.

Future research might involve extending this framework to further reduce search times and exploring its applicability across diverse computer vision challenges beyond stereo matching, offering an innovative direction for NAS methodologies. The demonstrated success in substantially reducing computational demands without sacrificing performance proves promising for the adoption of this framework in real-world systems requiring efficient and adaptive vision solutions.

PDF Markdown

Related Papers

GitHub

GitHub - XuelianCheng/LEAStereo: Hierarchical Neural Architecture Searchfor Deep Stereo Matching (NeurIPS 2020) (258 stars)