Large Scale Joint Semantic Re-Localisation and Scene Understanding via Globally Unique Instance Coordinate Regression

Published 23 Sep 2019 in cs.CV | (1909.10239v1)

Abstract: In this work we present a novel approach to joint semantic localisation and scene understanding. Our work is motivated by the need for localisation algorithms which not only predict 6-DoF camera pose but also simultaneously recognise surrounding objects and estimate 3D geometry. Such capabilities are crucial for computer vision guided systems which interact with the environment: autonomous driving, augmented reality and robotics. In particular, we propose a two step procedure. During the first step we train a convolutional neural network to jointly predict per-pixel globally unique instance labels and corresponding local coordinates for each instance of a static object (e.g. a building). During the second step we obtain scene coordinates by combining object center coordinates and local coordinates and use them to perform 6-DoF camera pose estimation. We evaluate our approach on real world (CamVid-360) and artificial (SceneCity) autonomous driving datasets. We obtain smaller mean distance and angular errors than state-of-the-art 6-DoF pose estimation algorithms based on direct pose regression and pose estimation from scene coordinates on all datasets. Our contributions include: (i) a novel formulation of scene coordinate regression as two separate tasks of object instance recognition and local coordinate regression and a demonstration that our proposed solution allows to predict accurate 3D geometry of static objects and estimate 6-DoF pose of camera on (ii) maps larger by several orders of magnitude than previously attempted by scene coordinate regression methods, as well as on (iii) lightweight, approximate 3D maps built from 3D primitives such as building-aligned cuboids.

Abstract PDF Upgrade to Chat

Citations (21)

View on Semantic Scholar

Summary

The paper introduces a novel scene coordinate formulation that integrates object instance recognition with local coordinate regression for precise 3D geometry estimation.
It demonstrates scalability by handling large maps up to 11.5 km and achieving superior median distance and angular error performance compared to baselines.
The framework enables lightweight 3D mapping with building-aligned cuboids, ensuring robust 6-DoF pose estimation applicable to autonomous driving, AR, and robotics.

Large Scale Joint Semantic Re-Localization and Scene Understanding

The paper "Large Scale Joint Semantic Re-Localization and Scene Understanding via Globally Unique Instance Coordinate Regression" by Budvytis et al. introduces an innovative framework for semantic localization and scene comprehension. The method is notably distinct in that it merges conventional 6-DoF camera pose estimation with scene understanding tasks, emphasizing its applicability in fields such as autonomous driving, augmented reality, and robotics. The central idea posited by the authors stems from the necessity for localization systems that not only predict the camera pose but also provide semantic insights into the surrounding environment.

Methodology Overview

The proposed methodology is two-pronged. Initially, a convolutional neural network (CNN) is tasked with predicting per-pixel globally unique instance labels coupled with local coordinates specific to each instance of static objects. Building upon this, scene coordinates are derived by integrating object center coordinates with the local coordinates. These scene coordinates then facilitate the estimation of the 6-DoF camera pose through the solution of a perspective-n-point problem using EPnP with RANSAC.

Key Contributions

Novel Scene Coordinate Formulation: The authors propose a significant reformulation of scene coordinate regression, dividing it into object instance recognition and local coordinate regression tasks. This approach offers a refined mechanism for predicting accurate 3D geometry and estimating camera pose.
Scalability: The framework allows for the handling of maps far larger than those previously managed by scene coordinate regression techniques, with evaluations conducted on approximately 1.5 km to 11.5 km driving sequences.
Approximate 3D Mapping: Another noteworthy contribution is the development of lightweight 3D maps using simple primitives, like building-aligned cuboids, maintaining robustness and accuracy in pose estimation with reduced complexity.

Empirical Performance

Evaluations were conducted on both real-world datasets (CamVid-360) and artificial datasets (SceneCity). The results demonstrated superior performance in mean distance and angular errors relative to state-of-the-art algorithms such as PoseNet and methods based on direct pose regression or scene coordinates.

On CamVid-360 and SceneCity datasets, the method achieved median distance errors of 22 cm and 20 cm, and angular errors of 0.71° and 0.76° respectively, surpassing the baseline methods.
For the CamVid-360 dataset, a notable portion of the pixels were localized within 50 cm of their ground truth locations.

Implications and Future Directions

The implications of this work are multifaceted. Practically, the integration of semantic scene understanding with pose estimation holds substantial promise for real-time applications where both efficiency and accuracy are imperative. Theoretically, this work underscores a noteworthy shift towards simultaneously addressing multiple tasks within a unified framework. Future research directions may explore extending this model to dynamic environments, incorporating real-time feedback for continuous learning, and leveraging additional data modalities to refine robustness and scalability.

In conclusion, the paper advocates for a comprehensive framework integrating localization and semantic scene understanding, showcasing promising advancements over existing models and setting a strong foundation for future explorations in the domain of computer vision and autonomous systems.

Markdown