Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DeDoDe: Detect, Don't Describe -- Describe, Don't Detect for Local Feature Matching (2308.08479v3)

Published 16 Aug 2023 in cs.CV

Abstract: Keypoint detection is a pivotal step in 3D reconstruction, whereby sets of (up to) K points are detected in each view of a scene. Crucially, the detected points need to be consistent between views, i.e., correspond to the same 3D point in the scene. One of the main challenges with keypoint detection is the formulation of the learning objective. Previous learning-based methods typically jointly learn descriptors with keypoints, and treat the keypoint detection as a binary classification task on mutual nearest neighbours. However, basing keypoint detection on descriptor nearest neighbours is a proxy task, which is not guaranteed to produce 3D-consistent keypoints. Furthermore, this ties the keypoints to a specific descriptor, complicating downstream usage. In this work, we instead learn keypoints directly from 3D consistency. To this end, we train the detector to detect tracks from large-scale SfM. As these points are often overly sparse, we derive a semi-supervised two-view detection objective to expand this set to a desired number of detections. To train a descriptor, we maximize the mutual nearest neighbour objective over the keypoints with a separate network. Results show that our approach, DeDoDe, achieves significant gains on multiple geometry benchmarks. Code is provided at https://github.com/Parskatt/DeDoDe

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Magsac++, a fast, reliable and accurate robust estimator. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1304–1312, 2020.
  2. Key. net: Keypoint detection by handcrafted and learned cnn filters. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5836–5844, 2019.
  3. A case for using rotation invariant features in state of the art feature matchers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5110–5119, 2022.
  4. Learning to match features with seeded graph matching network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6301–6310, 2021.
  5. ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer. In Proc. European Conference on Computer Vision (ECCV), 2022.
  6. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018.
  7. D2-Net: A Trainable CNN for Joint Detection and Description of Local Features. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  8. DKM: Dense Kernelized Feature Matching for Geometry Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 17765–17775, 2023a.
  9. RoMa: Revisiting robust lossses for dense feature matching. arXiv preprint arXiv:2305.15404, 2023b.
  10. SiLK: Simple Learned Keypoints. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
  11. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
  12. A combined corner and edge detector. In Proceedings of the Alvey vision conference, page 147–151, 1988.
  13. Predicting matchability. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9–16, 2014.
  14. Image matching challenge 2022, 2022.
  15. Self-supervised learning of image scale and orientation. In 31st British Machine Vision Conference 2021, BMVC 2021, Virtual Event, UK. BMVA Press, 2021.
  16. Decoupling makes weakly supervised local feature better. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15838–15848, 2022.
  17. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018.
  18. LightGlue: Local Feature Matching at Light Speed. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
  19. David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
  20. Working hard to know your neighbor’s margins: Local descriptor learning loss. Advances in neural information processing systems, 30, 2017.
  21. Repeatability is not enough: Learning affine regions via discriminability. In Proceedings of the European conference on computer vision (ECCV), pages 284–300, 2018.
  22. DINOv2: Learning robust visual features without supervision. arXiv:2304.07193, 2023.
  23. SOLD2: Self-supervised occlusion-aware line description and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11368–11378, 2021.
  24. DeepLSD: Line segment detection and refinement with deep image gradients. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17327–17336, 2023.
  25. R2d2: Reliable and repeatable detector and descriptor. Advances in neural information processing systems, 32:12405–12415, 2019.
  26. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020.
  27. Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8601–8610, 2018.
  28. Training for task specific keypoint detection. In Pattern Recognition: 31st DAGM Symposium, Jena, Germany, September 9-11, 2009. Proceedings 31, pages 151–160. Springer, 2009.
  29. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8922–8931, 2021.
  30. Inloc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7199–7209, 2018.
  31. Quadtree attention for vision transformers. In International Conference on Learning Representations, 2022.
  32. DISK: learning local features with policy gradient. In NeurIPS, 2020.
  33. Tilde: A temporally invariant learned detector. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5279–5288, 2015.
  34. MatchFormer: Interleaving attention in transformers for feature matching. In Asian Conference on Computer Vision, 2022.
  35. Learning soft estimator of keypoint scale and orientation with probabilistic covariant loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19406–19415, 2022.
  36. Lift: Learned invariant feature transform. In European conference on computer vision, pages 467–483. Springer, 2016.
  37. ASTR: Adaptive spot-guided transformer for consistent local feature matching. In The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
  38. Alike: Accurate and lightweight keypoint detection and descriptor extraction. IEEE Transactions on Multimedia, 2022.
  39. Aliked: A lighter keypoint and descriptor extraction network via deformable transformation. IEEE Transactions on Instrumentation & Measurement, 72:1–16, 2023.
  40. PMatch: Paired masked image modeling for dense geometric matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
Citations (20)

Summary

  • The paper introduces the DeDoDe framework that decouples keypoint detection from descriptor learning to enhance 3D consistency.
  • It employs a semi-supervised two-view detection strategy and a mutual nearest neighbor objective for robust descriptor training.
  • Experiments on MegaDepth-1500 and the Image Matching Challenge 2022 demonstrate significant gains in AUC and mAA metrics.

Overview of DeDoDe: Keypoint Detection for 3D Reconstruction

This essay examines the paper "DeDoDe: Detect, Don't Describe Describe, Don't Detect for Local Feature Matching" by Johan Edstedt et al., which introduces a new approach to keypoint detection critical for 3D reconstruction tasks. The authors tackle the challenge of detecting 3D-consistent keypoints across different views, independent of descriptor-based proxy tasks, which may not guarantee true 3D consistency.

Methodology

The core contribution of this work is the DeDoDe framework, which decouples the detection and description phases within the keypoint detection process for Structure from Motion (SfM). The authors identify the limitations of traditional joint learning approaches that bind keypoints to specific descriptors and instead propose a two-pronged strategy:

  1. Keypoint Detection (Detect, Don't Describe): DeDoDe learns detectors using ground-truth 3D tracks from SfM reconstructions, avoiding reliance on the usual descriptor-driven nearest neighbors proxy. The approach incorporates a semi-supervised two-view detection objective that expands from a sparse set of keypoints to a more comprehensive detection set. This framework enhances precision and recall beyond initial detectors by directly optimizing 3D consistency.
  2. Descriptor Learning (Describe, Don't Detect): Separately, descriptors are trained using a mutual nearest neighbor objective. Two descriptor models, DeDoDe-B and DeDoDe-G, are introduced, with the latter incorporating DINOv2 features to accommodate complex structures requiring more extensive contextual understanding.

Results and Performance

The DeDoDe method introduces substantial improvements in various geometry benchmarks. Notably, it bridges the performance gap between traditional detector-descriptor models and contemporary end-to-end matching frameworks. The extensive evaluations demonstrate that DeDoDe achieves strong performance metrics, with significant gains in benchmarks like MegaDepth-1500 and the Image Matching Challenge 2022.

  • MegaDepth-1500: DeDoDe exhibits a notable improvement with an AUC@5@5^{\circ} of 52.8 compared to 41.9 from prior state-of-the-art models such as ALIKED.
  • Image Matching Challenge 2022: The approach showcases its robustness and generalization capabilities, achieving competitive mAA@10@10 scores alongside advanced graph neural network-based matchers.

Implications and Future Directions

The decoupled training of detectors and descriptors as proposed in DeDoDe suggests a shift in how feature matching pipelines might be constructed, focusing on modularity and adaptability. This work supports the notion that detectors can be trained independently to improve generalizability and robustness across different tasks.

The practical implications of DeDoDe are significant for 3D computer vision applications such as augmented reality, robotics, and autonomous navigation, where robust and efficient keypoint detection is essential.

Future research could explore:

  • Integrating advanced augmentation techniques to improve DeDoDe's generalization in diverse settings.
  • Enhancing the detector's scalability and efficiency to handle more challenging viewpoints and illumination changes.
  • Experimenting with alternative datatypes (e.g., LiDAR) for keypoint detection.

This work presents a coherent argument for revisiting and revising the traditional interdependent learning strategies for detectors and descriptors, thereby paving the way for more versatile systems in 3D reconstruction contexts.

Youtube Logo Streamline Icon: https://streamlinehq.com