LF-Net: Learning Local Features from Images (1805.09662v2)

Published 24 May 2018 in cs.CV

Abstract: We present a novel deep architecture and a training strategy to learn a local feature pipeline from scratch, using collections of images without the need for human supervision. To do so we exploit depth and relative camera pose cues to create a virtual target that the network should achieve on one image, provided the outputs of the network for the other image. While this process is inherently non-differentiable, we show that we can optimize the network in a two-branch setup by confining it to one branch, while preserving differentiability in the other. We train our method on both indoor and outdoor datasets, with depth data from 3D sensors for the former, and depth estimates from an off-the-shelf Structure-from-Motion solution for the latter. Our models outperform the state of the art on sparse feature matching on both datasets, while running at 60+ fps for QVGA images.

Citations (495)

View on Semantic Scholar

Summary

The paper presents a novel end-to-end trainable framework that learns local features from images without relying on hand-crafted detectors.
It employs a multi-scale convolutional detector and descriptor leveraging depth and camera pose information, achieving higher accuracy than SIFT and SuperPoint.
The approach attains over 60 fps on QVGA images, demonstrating both computational efficiency and practical utility for real-time computer vision applications.

Analysis of LF-Net: Learning Local Features from Images

This paper introduces LF-Net, a novel approach to learning local features from images using a deep learning architecture. Its primary contribution lies in developing an end-to-end trainable framework for sparse feature matching that does not rely on hand-crafted detectors. The authors exploit depth and relative camera pose information to create a training signal without human supervision, which enables the network to learn feature correspondences directly from the data.

Methodological Insights

The LF-Net architecture comprises two principal components: a detector and a descriptor. The detector identifies keypoints with associated scales and orientations, while the descriptor computes distinctive vectors for these points. The detector employs a multi-scale convolutional network to capture keypoint locations in a scale-invariant manner. The novelty here is the use of differentiable and non-differentiable components in a two-branch network to train these models using image pairs with known depth and relative pose. This architecture allows for training the feature pipeline without requiring handcrafted priors.

Performance Evaluation

The experiments conducted involve rigorous testing on datasets with challenging image conditions. The authors demonstrate that their method outperforms existing approaches significantly in terms of matching accuracy. Notably, LF-Net achieves a throughput of over 60 fps for QVGA images, highlighting its computational efficiency.

Numerical Results and Claims

A key aspect of the research is the superior performance claims against state-of-the-art methods such as SIFT and SuperPoint. For instance, LF-Net outperforms traditional methods by a considerable margin when evaluated on challenging sparse feature matching tasks.

Practical and Theoretical Implications

Practically, the LF-Net approach has broad applications in any visual task that requires robust feature matching, such as object recognition, 3D reconstruction, and image retrieval. Theoretically, the paper contributes to the ongoing replacement of classical computer vision techniques with deep learning models, reinforcing the trend towards integrated, end-to-end learned solutions.

Prospects for Future Developments

Looking forward, this research paves the way for further exploration into unsupervised learning techniques for feature extraction using deep networks. Future work might delve into improving the robustness of the architecture against severe transformations or scaling the approach to handle even higher resolution images efficiently. There's potential for integrating this kind of approach with even more advanced learning paradigms, like self-supervised learning, which could lead to even more independence from supervised dataset constraints.

In conclusion, LF-Net represents a significant advancement in the automated extraction of local features directly from image data, circumventing the need for handcrafted interventions. By facilitating end-to-end learning with a quasi-supervised approach, it offers exciting potential for various computer vision applications and motivates further research into fully differentiable keypoint detection and matching systems.

PDF Markdown

Related Papers

GitHub

GitHub - vcg-uvic/lf-net-release: Code Release for LF-Net: Learning Local Features from Images (319 stars)