SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs (2404.00469v3)

Published 30 Mar 2024 in cs.CV

Abstract: We introduce a novel problem, i.e., the localization of an input image within a multi-modal reference map represented by a database of 3D scene graphs. These graphs comprise multiple modalities, including object-level point clouds, images, attributes, and relationships between objects, offering a lightweight and efficient alternative to conventional methods that rely on extensive image databases. Given the available modalities, the proposed method SceneGraphLoc learns a fixed-sized embedding for each node (i.e., representing an object instance) in the scene graph, enabling effective matching with the objects visible in the input query image. This strategy significantly outperforms other cross-modal methods, even without incorporating images into the map embeddings. When images are leveraged, SceneGraphLoc achieves performance close to that of state-of-the-art techniques depending on large image databases, while requiring three orders-of-magnitude less storage and operating orders-of-magnitude faster. The code will be made public.

Summary

The paper introduces a dual-phase approach that maps multi-modal data into a unified embedding space to enhance coarse visual localization.
It integrates geometric, visual, and structural features using architectures like PointNet, transformers, and Graph Attention Networks.
The method outperforms cross-modal baselines on indoor datasets by reducing storage requirements and achieving faster query operations.

Introduction

In the domain of computer vision and robotics, coarse visual localization or place recognition represents a key challenge, particularly relevant for applications in autonomous navigation and augmented reality. Traditional methods typically rely on large databases of posed images for localization, which can be storage-intensive and slow to query. The paper introduces SceneGraphLoc, a novel approach that tackles the challenge of localizing a query image within a multi-modal database represented by 3D scene graphs, efficiently leveraging information from various modalities such as object-level point clouds, images, attributes, and relationships between objects.

Methodology

SceneGraphLoc develops a dual-phase approach, encompassing both the generation of object embeddings within the scene graph and the generation of object embeddings from the query image. The first phase involves mapping multiple modalities into a unified embedding space for each node (representing an object instance) in the scene graph. The second phase focuses on learning fixed-sized embeddings for image patches, representing different object instances visible in the query image. These embeddings are then matched using nearest neighbor matching to associate each image patch with the most similar node in the scene graph, optimizing for cosine similarity.

Scene Graph Embedding

The core of the scene graph embedding phase lies in integrating multi-modal data into a compact yet informative representation for each object. This encompasses geometric information from point clouds, visual information from images, structural, attribute-based, and relationship information within the scene graph. The method employs architectures like PointNet for geometric features, transformers for integrating multi-view visual embeddings, and Graph Attention Networks for structural embeddings, highlighting the holistic approach towards embedding generation.

Contrastive Learning Framework

Learning the unified embedding space is achieved through a contrastive learning framework. This framework utilizes positive pairs (query image and its corresponding scene graph) along with negative samples (associating scene graphs of different scenes with the same query image) within a contrastive learning setup. The framework is designed to be robust against temporal changes, recognizing the dynamic nature of real-world scenes.

Experiments and Results

SceneGraphLoc was evaluated on two large-scale, real-world indoor datasets (3RScan and ScanNet), demonstrating its ability to significantly outperform other cross-modal methods and achieve close performance to state-of-the-art image-based methods with notably lower storage requirements and faster operation. The method benefits from its ability to effectively utilize a variety of modalities, providing a lightweight and efficient alternative for coarse localization tasks.

Conclusion and Future Work

The introduction of SceneGraphLoc represents a significant step towards leveraging 3D scene graphs for efficient visual localization. By employing multi-modal data and a contrastive learning framework, the approach notably enhances the performance of coarse localization tasks. The demonstrated efficiency in terms of both storage and computation opens up new possibilities for real-world applications, suggesting a promising direction for future developments in AI-assisted navigation and augmented reality systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/zhenjun_zhao/status/1775080803527172252

https://twitter.com/CSVisionPapers/status/1775669613851775185