TopicFM: Robust and Interpretable Topic-Assisted Feature Matching (2207.00328v3)
Abstract: This study addresses an image-matching problem in challenging cases, such as large scene variations or textureless scenes. To gain robustness to such situations, most previous studies have attempted to encode the global contexts of a scene via graph neural networks or transformers. However, these contexts do not explicitly represent high-level contextual information, such as structural shapes or semantic instances; therefore, the encoded features are still not sufficiently discriminative in challenging scenes. We propose a novel image-matching method that applies a topic-modeling strategy to encode high-level contexts in images. The proposed method trains latent semantic instances called topics. It explicitly models an image as a multinomial distribution of topics, and then performs probabilistic feature matching. This approach improves the robustness of matching by focusing on the same semantic areas between the images. In addition, the inferred topics provide interpretability for matching the results, making our method explainable. Extensive experiments on outdoor and indoor datasets show that our method outperforms other state-of-the-art methods, particularly in challenging cases. The code is available at https://github.com/TruongKhang/TopicFM.
- HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In CVPR, 5173–5182.
- Gan dissection: Visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1811.10597.
- Speeded-up robust features (SURF). Computer vision and image understanding, 110(3): 346–359.
- Reinforced feature points: Optimizing feature detection and description for a high-level task. In CVPR, 4948–4957.
- Gms: Grid-based motion statistics for fast, ultra-robust feature correspondence. In CVPR, 4181–4190.
- Latent dirichlet allocation. Journal of machine Learning research, 3(Jan): 993–1022.
- Brief: Binary robust independent elementary features. In ECCV, 778–792. Springer.
- Transformer interpretability beyond attention visualization. In CVPR, 782–791.
- Learning to match features with seeded graph matching network. In ICCV, 6301–6310.
- Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, volume 1, 1–2. Prague.
- Cuturi, M. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 26.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 5828–5839.
- Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 224–236.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- D2-net: A trainable cnn for joint description and detection of local features. In CVPR, 8092–8101.
- Effect of Parameter Optimization on Classical and Learning-based Image Matching Methods. In CVPR, 2506–2513.
- Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6): 381–395.
- Detecting interpretable and accurate scale-invariant keypoints. In ICCV, 2256–2263. IEEE.
- Multiple view geometry in computer vision. Cambridge university press.
- Mask r-cnn. In CVPR, 2961–2969.
- Robust image retrieval-based visual localization using kapture. arXiv preprint arXiv:2007.13867.
- Cotr: Correspondence transformer for matching across images. In CVPR, 6207–6217.
- Dual-resolution correspondence networks. NeurIPS, 33: 17346–17357.
- Megadepth: Learning single-view depth prediction from internet photos. In CVPR, 2041–2050.
- Feature pyramid networks for object detection. In CVPR, 2117–2125.
- Microsoft coco: Common objects in context. In ECCV, 740–755. Springer.
- CODE: Coherence based decision boundaries for feature correspondence. TPAMI, 40(1): 34–47.
- Fully convolutional networks for semantic segmentation. In CVPR, 3431–3440.
- Lowe, D. G. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2): 91–110.
- Contextdesc: Local descriptor augmentation with cross-modality context. In CVPR, 2527–2536.
- Aslfeat: Learning local features of accurate shape and localization. In CVPR, 6589–6598.
- Image matching from handcrafted to deep features: A survey. International Journal of Computer Vision, 129(1): 23–79.
- Image stylization for robust features. arXiv preprint arXiv:2008.06959.
- Scalable nearest neighbor algorithms for high dimensional data. TPAMI, 36(11): 2227–2240.
- ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE transactions on robotics, 31(5): 1147–1163.
- LF-Net: Learning local features from images. NeurIPS, 31.
- R2D2: repeatable and reliable detector and descriptor. arXiv preprint arXiv:1906.06195.
- Efficient neighbourhood consensus networks via submanifold sparse convolutions. In ECCV, 605–621. Springer.
- Neighbourhood consensus networks. NeurIPS, 31.
- ORB: An efficient alternative to SIFT or SURF. In ICCV, 2564–2571. Ieee.
- From coarse to fine: Robust hierarchical localization at large scale. In CVPR, 12716–12725.
- Superglue: Learning feature matching with graph neural networks. In CVPR, 4938–4947.
- Improving image-based localization by active correspondence search. In ECCV, 752–765. Springer.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 618–626.
- Efficient attention: Attention with linear complexities. In WACV, 3531–3539.
- ClusterGNN: Cluster-based Coarse-to-Fine Graph Neural Network for Efficient Feature Matching. In CVPR, 12517–12526.
- Video Google: A text retrieval approach to object matching in videos. In ICCV, volume 3, 1470–1470. IEEE Computer Society.
- Segmenter: Transformer for semantic segmentation. In ICCV, 7262–7272.
- LoFTR: Detector-free local feature matching with transformers. In CVPR, 8922–8931.
- InLoc: Indoor visual localization with dense matching and view synthesis. In CVPR, 7199–7209.
- DISK: Learning local features with policy gradient. NeurIPS, 33: 14254–14265.
- Interpretable image recognition by constructing transparent embedding space. In ICCV, 895–904.
- MatchFormer: Interleaving Attention in Transformers for Feature Matching. arXiv preprint arXiv:2203.09645.
- Explainable face recognition. In ECCV, 248–263. Springer.
- A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web, 1445–1456.
- Lift: Learned invariant feature transform. In ECCV, 467–483. Springer.
- Learning to find good correspondences. In CVPR, 2666–2674.
- Object-contextual representations for semantic segmentation. In ECCV, 173–190. Springer.
- Learning two-view correspondences and geometry using order-aware network. In ICCV, 5845–5854.
- Reference pose generation for long-term visual localization via learned features and view synthesis. International Journal of Computer Vision, 129(4): 821–844.
- Towards interpretable deep metric learning with structural matching. In ICCV, 9887–9896.
- Learning deep features for discriminative localization. In CVPR, 2921–2929.
- Patch2pix: Epipolar-guided pixel-level correspondences. In CVPR, 4669–4678.