Unsupervised Depth Completion from Visual Inertial Odometry (1905.08616v4)

Published 15 May 2019 in cs.CV, cs.AI, cs.LG, and stat.ML

Abstract: We describe a method to infer dense depth from camera motion and sparse depth as estimated using a visual-inertial odometry system. Unlike other scenarios using point clouds from lidar or structured light sensors, we have few hundreds to few thousand points, insufficient to inform the topology of the scene. Our method first constructs a piecewise planar scaffolding of the scene, and then uses it to infer dense depth using the image along with the sparse points. We use a predictive cross-modal criterion, akin to `self-supervision,' measuring photometric consistency across time, forward-backward pose consistency, and geometric compatibility with the sparse point cloud. We also launch the first visual-inertial + depth dataset, which we hope will foster additional exploration into combining the complementary strengths of visual and inertial sensors. To compare our method to prior work, we adopt the unsupervised KITTI depth completion benchmark, and show state-of-the-art performance on it. Code available at: https://github.com/alexklwong/unsupervised-depth-completion-visual-inertial-odometry.

Citations (117)

View on Semantic Scholar

Summary

The paper introduces a two-stage process that first generates a coarse, piecewise planar scaffold from sparse VIO data and then refines it using an encoder-decoder neural network.
It employs four loss components—photometric, sparse depth, pose consistency, and local smoothness—augmented by novel exponential and logarithmic mapping layers to enhance rotation parameterization.
The method is validated on the newly introduced VOID dataset and the KITTI benchmark, highlighting its potential for efficient, unsupervised depth completion in mobile and consumer applications.

Unsupervised Depth Completion from Visual Inertial Odometry: A Methodological Overview and Analysis

The paper "Unsupervised Depth Completion from Visual Inertial Odometry" by Wong et al. introduces an innovative approach to inferring dense depth maps by leveraging sparse depth data acquired through visual-inertial odometry (VIO) systems. This research positions itself distinctively by bypassing the dense point cloud availability typically provided by lidar or structured light sensors. Instead, it proposes to infer the scene topology utilizing a piecewise planar representation as the scaffolding, alongside a sparse set of depth measurements and monocular image sequences.

Method and Implementation

The core methodology is a two-stage process comprising a scaffold generation followed by a refinement step facilitated by a neural network. The scaffold serves as a coarse estimation of the scene's geometry, constructed by applying the lifting transform and Delaunay triangulation to the sparse depth points.

The refinement leverages a network that combines visual and depth data to enhance the initial scaffold-generated depth map. The encoder-decoder architecture is employed with separate pathways for image and depth inputs, following a late fusion strategy. This choice reportedly reduces parameter count significantly compared to models utilizing ResNet encoders, with the refined VGG11 architecture harnessing only about 9.7M parameters compared to the 27.8M of a comparable supervised model.

Loss Functionality and Training

The training protocol is predicated on four loss components:

Photometric Consistency: Encouraging temporal consistency via image reconstruction errors.
Sparse Depth Consistency: Grounding the output to depth measurement for scale integrity.
Pose Consistency: Ensuring forward-backward pose estimations are consistent using an auxiliary pose network.
Local Smoothness: Assuring the predicted depth adheres to smoothness criteria without sacrificing sharp transitions.

A novel contribution is the introduction of exponential and logarithmic mapping layers for better rotation parameterization in pose estimation, shown empirically to yield superior depth reconstructions.

Dataset and Benchmarking

The paper contributes significantly to benchmarking by introducing the “Visual Odometry with Inertial and Depth” (VOID) dataset. Recognizing the limitations of existing datasets that do not couple visual and inertial data with depth, VOID offers a comprehensive environment for evaluating unsupervised depth completion models. Additionally, the method has been validated on the KITTI depth completion benchmark, achieving marginal gains over existing unsupervised methods, notably those using monocular video without ancillary supervision from stereo pairs or lidar data.

Implications and Future Directions

Wong et al.'s approach illustrates a significant step towards more resource-efficient depth completion models which could efficiently be used in environments where full lidar accessibility is neither practical nor feasible, such as in mobile robotic platforms or consumer electronics. The integration of visual-inertial systems aligns well with the ubiquity of these sensors in mobile devices, suggesting industrial applicability.

Future research could focus on further optimizing pose estimation in more challenging motion sequences, as indicated by the performance of their model on the VOID dataset. Moreover, the exploration of neural architectures that further harmonize VIO data and monocular inputs for enhanced learning of depth priors is a promising direction.

In summary, this work comprehensively bridges the use of sparse visual-inertial cues and neural networks for depth completion without heavily relying on manual supervision or dense data acquisition techniques. The research opens pathways towards scalable, unsupervised depth models well-suited for real-world implementation, inviting further exploration within the academic community to broaden practical deployment.

PDF Markdown

Related Papers

GitHub

GitHub - alexklwong/unsupervised-depth-completion-visual-inertial-odometry: Tensorflow and PyTorch implementation of Unsupervised Depth Completion from Visual Inertial Odometry (in RA-L January 2020 & ICRA 2020) (189 stars)

Tweets

https://twitter.com/MLRepositories/status/1628479501955899397