Learning to Find Good Correspondences (1711.05971v2)

Published 16 Nov 2017 in cs.CV

Abstract: We develop a deep architecture to learn to find good correspondences for wide-baseline stereo. Given a set of putative sparse matches and the camera intrinsics, we train our network in an end-to-end fashion to label the correspondences as inliers or outliers, while simultaneously using them to recover the relative pose, as encoded by the essential matrix. Our architecture is based on a multi-layer perceptron operating on pixel coordinates rather than directly on the image, and is thus simple and small. We introduce a novel normalization technique, called Context Normalization, which allows us to process each data point separately while imbuing it with global information, and also makes the network invariant to the order of the correspondences. Our experiments on multiple challenging datasets demonstrate that our method is able to drastically improve the state of the art with little training data.

Authors (6)

Kwang Moo Yi (68 papers)
Eduard Trulls (14 papers)
Yuki Ono (2 papers)
Vincent Lepetit (101 papers)
Mathieu Salzmann (185 papers)
Pascal Fua (176 papers)

Citations (457)

View on Semantic Scholar

Summary

The paper presents a novel deep learning approach that uses an MLP with Context Normalization to classify inliers and outliers in stereo matching.
The methodology leverages pixel coordinate inputs and a weighted eight-point algorithm for refined essential matrix estimation in an end-to-end training framework.
Experimental results show doubled performance over state-of-the-art methods on challenging datasets, highlighting its robustness and potential for robotics and AR applications.

Learning to Find Good Correspondences

This paper presents a novel approach to improving the accuracy of wide-baseline stereo correspondence matching using deep learning. The proposed algorithm is designed to distinguish inliers from outliers among correspondences in order to enhance relative pose estimation, which is crucial in applications like Structure from Motion (SfM).

Methodology

The authors introduce a deep learning architecture based on a multi-layer perceptron (MLP) that focuses solely on pixel coordinates, eschewing direct image processing. A notable innovation in the network structure is the use of Context Normalization. This technique allows each data point to be evaluated independently while also incorporating global context, which renders the network invariant to the order of correspondences. This characteristic is critical for handling unordered data, such as a set of point correspondences in stereo matching.

The learning process is conducted in an end-to-end fashion, where the network is trained to classify correspondences as inliers or outliers, while concurrently optimizing the accuracy of the essential matrix for pose recovery. Pose recovery is performed using a modified version of the eight-point algorithm, now adapted to operate on weighted correspondences. The network assigns weights corresponding to the likelihood of each match being correct, thus enabling iterative refinement of pose estimation performance.

Experimental Results

The experimental validation of the model demonstrates significant improvements over state-of-the-art techniques in multiple datasets, particularly for challenging scenes with wide baselines and varying imaging conditions. The network achieves these gains with sparse training data, indicating its robustness and generalization capability. On average, the method doubles the performance of existing methods, underscoring the effectiveness of integrating deep learning approaches with classical computer vision methodologies, such as the eight-point algorithm.

Implications and Future Directions

The findings of this paper have both practical and theoretical implications. Practically, the algorithm offers a means to more reliably recover relative camera motion in challenging visual environments, potentially benefiting applications in robotics and augmented reality, where precise motion estimation is pivotal. Theoretically, the success of Context Normalization hints at new directions for developing deep networks capable of processing unordered data.

For future work, the authors consider extending their methodology to scenarios involving unknown camera intrinsics. Adapting their approach to estimate the fundamental matrix could widen the applicability of their framework, albeit posing challenges related to stability in parameter estimation. Investigating these open issues could significantly expand the utility and robustness of deep learning in geometric computer vision tasks.

PDF Markdown