LoFTR: Detector-Free Local Feature Matching with Transformers (2104.00680v1)

Published 1 Apr 2021 in cs.CV and cs.RO

Abstract: We present a novel method for local image feature matching. Instead of performing image feature detection, description, and matching sequentially, we propose to first establish pixel-wise dense matches at a coarse level and later refine the good matches at a fine level. In contrast to dense methods that use a cost volume to search correspondences, we use self and cross attention layers in Transformer to obtain feature descriptors that are conditioned on both images. The global receptive field provided by Transformer enables our method to produce dense matches in low-texture areas, where feature detectors usually struggle to produce repeatable interest points. The experiments on indoor and outdoor datasets show that LoFTR outperforms state-of-the-art methods by a large margin. LoFTR also ranks first on two public benchmarks of visual localization among the published methods.

Citations (944)

View on Semantic Scholar

Summary

The paper introduces a detector-free approach using Transformers to achieve dense and robust feature matching in challenging visual scenarios.
It employs a coarse-to-fine matching pipeline that first establishes pixel-level correspondences and then refines them to sub-pixel precision.
Experimental results show LoFTR outperforms state-of-the-art methods on benchmarks like HPatches and ScanNet, promising advancements in visual localization and 3D reconstruction.

Detector-Free Local Feature Matching with Transformers

This paper presents Local Feature Transformer (LoFTR), an innovative method that foregoes traditional feature detectors for local image feature matching, leveraging Transformer architectures to achieve robust matching performance in challenging Szenarios. By avoiding the conventional tripartite approach—feature detection, description, and matching—LoFTR addresses limitations of repeatability and reliability in low-texture and repetitive pattern regions. The authors instead emphasize the capacity of Transformers for both self and cross-attention, enabling global context consideration and robust feature matching across diverse regions.

Local image matching typically includes selecting salient points, describing local neighborhoods, and then determining correspondence. However, such a detector-based approach often fails in low-texture or repetitive areas. LoFTR confronts this issue by initially establishing dense, pixel-wise matches at a coarse level with Transformers, which are refined subsequently to a sub-pixel level.

Key Contributions

Transformer Utilization for Dense Matching: Traditional convolutional neural networks (CNNs) have limited receptive fields which restrict their ability to distinguish features in low-texture regions. LoFTR employs Transformers to exploit a global receptive field, enabling the identification of features across larger contexts. This facilitates matching in indistinctive areas by considering positional encoding and transforming features with attention mechanisms.
Detector-Free Approach: LoFTR circumvents the dependency on feature detectors, which often struggle with repeatability. By establishing dense matches directly using a Transformer model, the system provides a more reliable set of matches even in challenging visual conditions like motion blur or significant viewpoint changes.
Coarse-to-Fine Matching: The innovation lies in the two-stage matching process wherein coarse matches are refined using finer resolutions. This approach entails initially matching features at a downsampled resolution before refining them precisely, using a correlation-based method.
Optimization and Efficiency: The computational complexity is mitigated through the use of linear Transformers, reducing attention-related computational costs from quadratic to linear. This advancement enhances real-time applicability of LoFTR, making it efficient even for higher-resolution images.

Experimental Results

Experiments conducted on various benchmarks, such as the HPatches, ScanNet, and MegaDepth datasets, demonstrate the superior performance of LoFTR over existing methods. For homography estimation on HPatches, LoFTR significantly outperforms the state-of-the-art, especially under stricter error thresholds. In relative pose estimation tasks on ScanNet and MegaDepth, LoFTR consistently leads in AUC metrics.

Notably, in visual localization challenges on the Aachen Day-Night and InLoc datasets, LoFTR surpasses leading methods, particularly excelling in environments characterized by extreme illumination changes or low-texture regions. The results underscore the potential practical impact of LoFTR in diverse real-world applications requiring reliable visual matching under challenging conditions.

Practical and Theoretical Implications

Practically, LoFTR can enhance applications in 3D reconstruction, SLAM, and visual localization, where robustness to inaccurate feature detection is critical. Its detector-free design translates to fewer failures in areas traditionally problematic for feature-based approaches, promising more robust performance without being skewed by feature detector limitations.

Theoretically, this work pushes the boundary on how Transformer architectures can be applied to vision tasks. It challenges the detector-centric paradigm, potentially steering future research towards detector-free methods that leverage global context considerations inherent to Transformer models. This could redefine approaches to not only feature matching but to broader image understanding tasks as well.

Future Developments and Directions

The paper hints at various avenues for future work. Extending LoFTR to handle extreme environmental changes, such as severe seasonal variations, can be a significant direction. Additionally, the integration of LoFTR in mobile and edge devices could be explored given its computational efficiency.

In conclusion, the LoFTR model proposed in this paper sets a promising trajectory for local feature matching by innovating on transformer architectures' potential, presenting a robust and computationally efficient alternative to traditional detector-based methods. This advance is poised to benefit a wide array of applications within computer vision, from augmented reality systems to autonomous navigation.

PDF Markdown