Driven to Distraction: Self-Supervised Distractor Learning for Robust Monocular Visual Odometry in Urban Environments (1711.06623v2)

Published 17 Nov 2017 in cs.RO, cs.AI, cs.CV, and cs.LG

Abstract: We present a self-supervised approach to ignoring "distractors" in camera images for the purposes of robustly estimating vehicle motion in cluttered urban environments. We leverage offline multi-session mapping approaches to automatically generate a per-pixel ephemerality mask and depth map for each input image, which we use to train a deep convolutional network. At run-time we use the predicted ephemerality and depth as an input to a monocular visual odometry (VO) pipeline, using either sparse features or dense photometric matching. Our approach yields metric-scale VO using only a single camera and can recover the correct egomotion even when 90% of the image is obscured by dynamic, independently moving objects. We evaluate our robust VO methods on more than 400km of driving from the Oxford RobotCar Dataset and demonstrate reduced odometry drift and significantly improved egomotion estimation in the presence of large moving vehicles in urban traffic.

Citations (67)

View on Semantic Scholar

Summary

The paper introduces a self-supervised learning method using ephemerality masks to make monocular visual odometry robust against dynamic objects in urban environments.
The proposed method uses prior 3D maps from multi-session data and entropy analysis to automatically generate training data for depth and ephemerality mask prediction.
Empirical evaluation on the Oxford RobotCar dataset shows reduced odometry drift and improved egomotion estimation compared to conventional methods, even with large moving object occlusions.

Self-Supervised Learning for Robust Monocular Visual Odometry

The paper, "Driven to Distraction: Self-Supervised Learning for Robust Monocular Visual Odometry in Urban Environments," introduces an innovative approach to enhancing visual odometry (VO) for autonomous vehicle navigation in dynamic urban settings. The core contribution is a self-supervised learning mechanism that predicts an 'ephemerality mask' for every pixel in camera images. This mask helps identify static elements that are reliable for motion estimation and ephemeral objects that may introduce errors due to their dynamic nature.

Methodological Overview

The authors present a technique to automatically generate ephemerality masks and depth maps, which are used to train a deep convolutional network. The training process involves offline multi-session data collection using a LIDAR-equipped survey vehicle, which analyzes structural entropy to distinguish static from dynamic regions in repeated traversals. At run-time, with input from a single monocular camera, the network predicts depth and ephemerality masks that feed into the VO pipeline, utilizing either sparse features or dense photometric data.

Key methodology components include:

Prior 3D Mapping: Multi-session traversals generate a 3D pointcloud. An entropy-based analysis then discerns static structures from ephemeral regions, essential for creating reliable 3D maps amidst the dynamics typical of urban environments.
Ephemerality Mask Labelling: The approach avoids predefined classes or manual annotation by using disparity and normal differences between real-time stereo imagery and prior maps.
Network Architecture: A convolutional encoder-multi-decoder network model, trained using raw monocular video data, predicts depth and ephemerality.
Robust Visual Odometry: Two VO methodologies are presented — a sparse feature-based approach and a dense photometric approach — both employing the ephemerality mask for improved motion estimation.

Results and Implications

Empirical evaluation on over 400km from the Oxford RobotCar dataset demonstrates the paper's VO method yields reductions in odometry drift and enhancement in egomotion estimation, significantly benefiting scenarios with large moving object occlusion. The ephemerality-aware approach delivers better velocity estimates in the presence of distractors compared to conventional methods, showcasing the robustness and real-time applicability of the proposed system.

Broader Impact and Future Developments

The self-supervised generation of ephemerality masks provides a significant advantage for autonomous systems, especially in urban settings overwhelmed by dynamic environments. This approach enables the deployment of visual navigation with minimal sensing requirements and without necessitating exhaustive manual data labeling.

The research holds promising implications for various applications beyond VO, such as dynamic object tracking, scene segmentation, obstacle detection, and improved real-time decision-making for autonomous systems. Future advancements can explore integrating the ephemerality mask concept into broader frameworks for environment perception, enhancing autonomous navigation, and object interaction tasks in complex, cluttered scenes. Further developments in AI might refine these processes to handle increased variability and improve generalization across diverse operational scenarios.

Related Papers

YouTube

Show All Videos