Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimal Transport Aggregation for Visual Place Recognition (2311.15937v2)

Published 27 Nov 2023 in cs.CV

Abstract: The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places, relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone, in order to form a global descriptor for each image. In this context, we introduce SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors), which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD, we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a 'dustbin' cluster, designed to selectively discard features deemed non-informative, enhancing the overall descriptor quality. Additionally, we leverage and fine-tune DINOv2 as a backbone, which provides enhanced description power for the local features, and dramatically reduces the required training time. As a result, our single-stage method not only surpasses single-stage baselines in public VPR datasets, but also surpasses two-stage methods that add a re-ranking with significantly higher cost. Code and models are available at https://github.com/serizba/salad.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Gsv-cities: Toward appropriate supervised visual place recognition. Neurocomputing, 513:194–203, 2022.
  2. Mixvpr: Feature mixing for visual place recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2998–3007, 2023.
  3. All about vlad. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1578–1585, 2013.
  4. Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016.
  5. Rethinking visual geo-localization for large-scale applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4878–4888, 2022.
  6. Eigenplaces: Training viewpoint robust models for visual place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11080–11090, 2023.
  7. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  8. A survey of optimal transport for computer graphics and computer vision. In Computer Graphics Forum, pages 439–460. Wiley Online Library, 2023.
  9. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Transactions on robotics, 32(6):1309–1332, 2016.
  10. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021.
  11. Unifying deep local and global features for image search. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 726–743. Springer, 2020.
  12. Deep learning features at scale for visual place recognition. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3223–3230. IEEE, 2017a.
  13. Only look once, mining distinctive landmarks from convnet for visual place recognition. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9–16. IEEE, 2017b.
  14. Learning context flexible attention model for long-term visual place recognition. IEEE Robotics and Automation Letters, 3(4):4015–4022, 2018.
  15. Fab-map: Probabilistic localization and mapping in the space of appearance. The International journal of robotics research, 27(6):647–665, 2008.
  16. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  18. Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics, 28(5):1188–1197, 2012.
  19. Where is your place, visual place recognition? arXiv preprint arXiv:2103.06443, 2021.
  20. Seqmatchnet: Contrastive learning with sequence matching for place recognition & relocalization. In Conference on Robot Learning, pages 429–443. PMLR, 2022.
  21. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
  22. Multi-process fusion: Visual place recognition using multiple image processing methods. IEEE Robotics and Automation Letters, 4(2):1924–1931, 2019.
  23. Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14141–14152, 2021.
  24. Textplace: Visual place recognition and topological localization through reading scene texts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2861–2870, 2019.
  25. From structure-from-motion point clouds to fast location recognition. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2599–2606. IEEE, 2009.
  26. Aggregating local descriptors into a compact image representation. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3304–3311. IEEE, 2010.
  27. Few-shot panoptic segmentation with foundation models. arXiv preprint arXiv:2309.10726, 2023.
  28. Anyloc: Towards universal visual place recognition. arXiv preprint arXiv:2308.00688, 2023.
  29. A holistic visual place recognition approach using lightweight cnns for significant viewpoint and appearance changes. IEEE transactions on robotics, 36(2):561–569, 2019.
  30. Mpvit: Multi-path vision transformer for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7287–7296, 2022.
  31. Data-efficient large scale place recognition with graded similarity supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23487–23496, 2023.
  32. Learning deep representations for ground-to-aerial geolocalization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5007–5015, 2015.
  33. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
  34. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  35. Visual place recognition: A survey. ieee transactions on robotics, 32(1):1–19, 2015.
  36. A survey on deep visual place recognition. IEEE Access, 9:19516–19547, 2021.
  37. Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights. In 2012 IEEE international conference on robotics and automation, pages 1643–1649. IEEE, 2012.
  38. Localization in urban environments using a panoramic gist descriptor. IEEE Transactions on Robotics, 29(1):146–160, 2012.
  39. A metric learning reality check. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pages 681–699. Springer, 2020.
  40. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  41. Gluestick: Robust image matching by sticking points and lines together. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9706–9716, 2023.
  42. Fast and robust earth mover’s distances. In 2009 IEEE 12th international conference on computer vision, pages 460–467. IEEE, 2009.
  43. Benchmarking image retrieval for visual localization. In 2020 International Conference on 3D Vision (3DV), pages 483–494. IEEE, 2020.
  44. Fine-tuning cnn image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018.
  45. Neighbourhood consensus networks. Advances in neural information processing systems, 31, 2018.
  46. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019.
  47. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020.
  48. Visual Place Recognition: A Tutorial. IEEE Robotics & Automation Magazine, 2023.
  49. Global features are all you need for image retrieval and reranking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11036–11046, 2023.
  50. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343–348, 1967.
  51. Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931, 2021.
  52. Brief-gist-closing the loop by simple means. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1234–1241. IEEE, 2011.
  53. On the performance of convnet features for place recognition. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 4297–4304. IEEE, 2015.
  54. Inloc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7199–7209, 2018.
  55. Detect-to-retrieve: Efficient regional aggregation for image search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5109–5118, 2019.
  56. Visual place recognition with repetitive structures. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 883–890, 2013.
  57. Transvpr: Transformer-based place recognition with multi-level attention aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13648–13657, 2022.
  58. Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5022–5030, 2019.
  59. Mapillary street-level sequences: A dataset for lifelong place recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2626–2635, 2020.
  60. Differentiable rendering using rgbxy derivatives and optimal transport. ACM Transactions on Graphics (TOG), 41(6):1–13, 2022.
  61. Vitmatte: Boosting image matting with pre-trained plain vision transformers. Information Fusion, 103:102091, 2024.
  62. Beyond the cls token: Image reranking using pretrained vision transformers. In BMVC, 2022.
  63. Visual place recognition: A survey from deep learning perspective. Pattern Recognition, 113:107760, 2021.
  64. R2former: Unified retrieval and reranking transformer for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19370–19380, 2023.
Citations (38)

Summary

  • The paper introduces SALAD, a single-stage method that reformulates feature aggregation as an optimal transport problem to achieve state-of-the-art Recall@1 scores (75.0% on MSLS and 76.0% on Nordland).
  • It integrates a dustbin cluster and fine-tunes DINOv2, enabling robust and informative image descriptors from deep features under varying conditions.
  • Experimental results demonstrate significant efficiency gains by eliminating re-ranking steps, promising practical benefits for robotics and augmented reality applications.

Optimal Transport Aggregation for Visual Place Recognition

The paper "Optimal Transport Aggregation for Visual Place Recognition" by Sergio Izquierdo and Javier Civera proposes a novel approach to aggregating visual features in the context of Visual Place Recognition (VPR). It introduces a single-stage methodology named SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors), which leverages optimal transport theory to enhance feature aggregation and improve place recognition accuracy over existing methods.

Methodology and Contributions

The authors frame the task of VPR as an image retrieval problem, where the goal is to match a query image against a database of geo-localized reference images. The effectiveness of this retrieval process hinges on the quality of image descriptors, which must be both discriminative and robust against challenges such as varying illumination, structural changes, and seasonal effects. Modern VPR systems often employ deep neural networks to extract features, followed by a process of feature aggregation to form global descriptors.

SALAD reformulates the feature aggregation process traditionally handled by methods like NetVLAD, which relies on clustering and assigning local features to cluster centroids. This paper introduces a significant modification by viewing this feature-to-cluster assignment as an optimal transport problem, a perspective that enables a more nuanced distribution of feature mass across clusters.

Key innovations in this approach include:

  • Use of Optimal Transport: By applying the Sinkhorn Algorithm to calculate feature assignments, SALAD optimally allocates feature mass not only from feature to clusters but also considers cluster-to-feature mass allocation, leading to more balanced and informative aggregates.
  • Dustbin Cluster: The model introduces a 'dustbin' cluster, allowing the discard of non-informative features, enhancing the robustness and quality of the resultant descriptors.
  • Use of Foundation Models: The integration of DINOv2, a Vision Transformer (ViT), as the backbone for feature extraction marks another core contribution. The model is not just used in its pre-trained form as in previous approaches but is fine-tuned specifically for the VPR task, yielding improved performance with reduced training times.

Empirical Results

Experimentation on standard benchmarks, including MSLS Challenge and Nordland, showcases DINOv2 SALAD's superiority in the VPR domain. Remarkably, this method achieves state-of-the-art results, with a reported 75.0% Recall@1 on the MSLS Challenge dataset and a 76.0% on Nordland. Such performance is realized without the added computation burden commonly associated with two-stage VPR pipelines, like re-ranking steps.

The performance gains in particularly challenging datasets (e.g., Nordland, known for its pronounced seasonal variations) underline SALAD's ability to generate highly discriminative descriptors resilient to environmental changes.

Theoretical and Practical Implications

Theoretically, SALAD bridges optimal transport theory with deep learning systems to address the feature aggregation problem in VPR, which could inspire similar approaches in other computer vision tasks requiring robust feature aggregation. The paper demonstrates that a careful reconsideration of the mathematical formulation of common tasks in machine learning, such as feature assignment, can yield significant improvements in system performance.

Practically, the reduction in training time and computational complexity signifies a considerable step forward for real-world applications where efficiency is crucial, such as in robotics and augmented reality systems deployed in dynamic environments.

Future Prospects

While the paper focuses primarily on outdoor environments with known benchmarks, the methods formulated have the potential for broader applications that may explore different scene domains or generalized retrieval tasks. Further research could aim to enhance the current methodology by integrating more nuanced constraints or exploring more advanced architectures in combination with optimal transport, embracing diverse fields such as medical image analysis.

In conclusion, the paper offers a well-articulated demonstration of how optimal transport can enhance deep learning pipelines through thoughtful problem reformulation and well-chosen architectural integrations, setting a precedent for similar advances in visual recognition and related fields.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com