Segment Anything Model is a Good Teacher for Local Feature Learning (2309.16992v3)

Published 29 Sep 2023 in cs.CV and cs.LG

Abstract: Local feature detection and description play an important role in many computer vision tasks, which are designed to detect and describe keypoints in "any scene" and "any downstream task". Data-driven local feature learning methods need to rely on pixel-level correspondence for training, which is challenging to acquire at scale, thus hindering further improvements in performance. In this paper, we propose SAMFeat to introduce SAM (segment anything model), a fundamental model trained on 11 million images, as a teacher to guide local feature learning and thus inspire higher performance on limited datasets. To do so, first, we construct an auxiliary task of Attention-weighted Semantic Relation Distillation (ASRD), which distillates feature relations with category-agnostic semantic information learned by the SAM encoder into a local feature learning network, to improve local feature description using semantic discrimination. Second, we develop a technique called Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic groupings derived from SAM as weakly supervised signals, to optimize the metric space of local descriptors. Third, we design an Edge Attention Guidance (EAG) to further improve the accuracy of local feature detection and description by prompting the network to pay more attention to the edge region guided by SAM. SAMFeat's performance on various tasks such as image matching on HPatches, and long-term visual localization on Aachen Day-Night showcases its superiority over previous local features. The release code is available at https://github.com/vignywang/SAMFeat.

Citations (7)

View on Semantic Scholar

Summary

The paper demonstrates that SAM-derived features can effectively guide local feature learning, significantly improving visual localization accuracy.
It integrates SAM with Edge Attention Guidance in a novel pipeline, reaching an MMA@3 of 82.1 on HPatches and outperforming conventional methods.
Extensive ablation studies confirm SAMFeat's efficiency, achieving high performance with only 6 hours of training on dual RTX 3090 GPUs.

Segment Anything Model as a Teacher for Local Feature Learning

The paper "Supplementary Materials for ECCV 2024 paper: Segment Anything Model is a Good Teacher for Local Feature Learning" provides a comprehensive analysis of employing the Segment Anything Model (SAM) as a foundational tool for enhancing local feature learning. This approach addresses some of the persistent challenges in accurately detecting and describing local features for effective visual localization, specifically through leveraging segmentation foundations.

Methodological Framework

The authors propose an innovative method integrating SAM features into the framework of local feature learning, tested primarily on the Aachen Day-Night dataset (v1.1). The meticulously designed visual localization pipeline utilizes extracted custom features to build a structure-from-motion model, subsequently registering query images to this model via mutual nearest neighbor keypoints matching. This strategic approach enables the model to achieve precise localization under various error tolerances.

A critical component of this work is the introduction of SAMFeat, a method for guiding edge map learning from SAM, which enhances the network’s attention to edge-rich regions through Edge Attention Guidance (EAG). The empirical results reveal that these refined edge maps lead to heightened accuracy and robustness of local descriptors, emphasizing the practical utility of the SAM-derived features.

Experimental Verification

Numerous ablation studies demonstrate the effectiveness of integrating SAM with various local feature learning components. Notably, Pixel Semantic Relational Distillation (PSRD) emerges as superior to traditional Direct Semantic Feature Distillation (DSFD), achieving a Mean Matching Accuracy (MMA) @3 of 78.6 compared to 76.9. The examinations extend to hyper-parameter tuning, with the configurations (M=0.07, T=5) yielding optimal results, reinforcing the importance of nuanced parameter selection in optimizing feature learning.

SAMFeat distinguishes itself by producing high MMA@3 scores of 82.1 on HPatches, outperforming competing methods utilizing larger training datasets. Furthermore, SAMFeat achieves this result with remarkable efficiency, requiring only 6 hours of training on dual Nvidia RTX 3090 GPUs, evidencing its lightweight nature and resource efficiency.

Implications and Future Directions

The findings from this paper contribute significantly to the understanding and application of visual foundation models. SAMFeat demonstrates that segmentation models can serve as valuable teachers in local feature learning domains, introducing new potential for zero-shot generalization in downstream tasks. These insights open avenues for further exploration into integrating other visual foundation models, such as visual pre-training models (e.g., DINOv2, MAE) and generative models, into local feature learning paradigms.

Future research could explore leveraging multimodal foundation models and improving synthetic dataset generation for robust local feature learning, addressing the evident constraints in current multimodal and generative model applications.

Conclusion

This paper provides a robust framework for utilizing SAM as a tool for improving local feature learning accuracy and efficiency. By seamlessly integrating SAM into the learning process, it is possible to transcend traditional limitations of local descriptor methodologies, paving the way for innovative applications across various computer vision domains. The robust experimental validation and insightful analysis set a precedent for future endeavors in the field, highlighting the practical feasibility and theoretical promise of leveraging segmentation-based models in local feature learning.