SuperGlue: Learning Feature Matching with Graph Neural Networks (1911.11763v2)

Published 26 Nov 2019 in cs.CV

Abstract: This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems. The code and trained weights are publicly available at https://github.com/magicleap/SuperGluePretrainedNetwork.

Citations (1,644)

View on Semantic Scholar

Summary

The paper introduces SuperGlue, a method that leverages attention-based GNNs for dynamic and robust local feature matching in diverse visual scenarios.
It employs a differentiable optimal transport framework via the Sinkhorn algorithm to address unmatched keypoints and optimize pairwise correspondences.
Extensive experiments show significant improvements in homography and pose estimation, making SuperGlue a versatile tool in SLAM and SfM pipelines.

SuperGlue: Learning Feature Matching with Graph Neural Networks

The presented paper proposes SuperGlue, a novel approach to feature matching in computer vision that relies on graph neural networks (GNNs) and attention mechanisms. This method aims to advance the crucial task of establishing correspondences between two sets of local features in images, addressing challenges such as occlusion, large viewpoint changes, and lighting variations.

Overview of SuperGlue Architecture

SuperGlue operates as a middle-end module situated between feature detection and pose estimation in Visual Simultaneous Localization and Mapping (SLAM) or Structure-from-Motion (SfM) pipelines. The technique integrates local features, learned or handcrafted, and constructs a flexible context aggregation mechanism through attention-based GNNs. The core architecture comprises two principal components:

Attentional Graph Neural Network: This element aggregates information within and across images through self- and cross-attention layers. Keypoints from both images form nodes in a graph, while self-edges connect keypoints within the same image and cross-edges link keypoints between images. The attention mechanism allows dynamic and selective information propagation, crucial for handling complex visual relationships and geometric constraints.
Optimal Matching Layer: This component solves a differentiable optimal transport problem to generate a partial assignment matrix, addressing the issue of unmatched keypoints due to occlusions and failure of detection. The assignment matrix is derived using the Sinkhorn algorithm, which ensures a balanced and optimal matching of keypoints.

Key Methodological Contributions

Contextual Aggregation: By leveraging attention mechanisms, SuperGlue can dynamically aggregate context beyond local information. This contrasts with previous methods that rely on limited heuristics or fixed receptive fields.
Optimal Transport for Matching: The relaxation of the linear assignment problem into an optimal transport problem allows for efficient and differentiable matching, facilitating end-to-end training.
Integration Flexibility: SuperGlue can be combined with various feature detectors and descriptors, making it a versatile component for existing computer vision pipelines.

Experimental Evaluation

SuperGlue was rigorously tested on tasks spanning synthetic homography estimation, indoor and outdoor pose estimation, and real-world visual localization. Key findings include:

Homography Estimation: SuperGlue achieved an Area Under Curve (AUC) of 65.85% for the Direct Linear Transform (DLT) method, significantly outperforming both handcrafted and other learned baselines. It exhibited high precision and recall in estimating correct matches, demonstrating robustness against large viewpoint changes and occlusions.
Indoor Pose Estimation: On the ScanNet dataset, SuperGlue yielded AUC values of 16.16%, 33.81%, and 51.84% at pose error thresholds of 5°, 10°, and 20°, respectively. This notably surpassed the results of other methods using SuperPoint features, confirming its superior performance in challenging indoor environments.
Outdoor Pose Estimation: The PhotoTourism dataset results showed SuperGlue achieving 34.18%, 50.32%, and 64.16% AUC at the given pose error thresholds, setting new benchmarks for outdoor scenes. This highlights its effectiveness in environments with significant lighting changes and occlusions.

Implications and Future Directions

The introduction of SuperGlue has several theoretical and practical implications:

Enhanced 3D Reconstruction: Reliable feature matching is pivotal for accurate 3D reconstruction. SuperGlue's ability to handle challenging scenarios improves the robustness and accuracy of 3D models generated in both indoor and outdoor settings.
Real-time Applications: Running in real-time on modern GPUs, SuperGlue is applicable to real-time tasks, such as SLAM in autonomous robotics and augmented reality.
Potential for End-to-End Systems: With SuperGlue, vision systems can better integrate end-to-end learning approaches, transitioning from modular hand-engineered systems to more flexible and powerful deep learning solutions.

Future developments could explore optimizing SuperGlue’s computational efficiency further, expanding its applicability to resource-constrained environments. Additionally, advancements could include fully integrating SuperPoint and SuperGlue in a jointly trained end-to-end pipeline, potentially improving both the detection and matching phases synergistically.

Conclusion

SuperGlue represents a significant advance in the domain of local feature matching, leveraging GNNs and attention mechanisms to provide a robust and flexible solution. Through extensive experiments, it has demonstrated superior performance across multiple challenging tasks, paving the way for enhanced visual perception systems in both academic and practical applications. The open-source availability of SuperGlue further promotes its adoption and potential for ongoing improvements in the computer vision community.

PDF Markdown

Related Papers

GitHub

GitHub - magicleap/SuperGluePretrainedNetwork: SuperGlue: Learning Feature Matching with Graph Neural Networks (CVPR 2020, Oral) (3,090 stars)

Tweets

https://twitter.com/ashu_1069/status/1892236388755447920

YouTube

Show All Videos