Unsupervised Depth Completion with Calibrated Backprojection Layers (2108.10531v2)

Published 24 Aug 2021 in cs.CV, cs.AI, and cs.LG

Abstract: We propose a deep neural network architecture to infer dense depth from an image and a sparse point cloud. It is trained using a video stream and corresponding synchronized sparse point cloud, as obtained from a LIDAR or other range sensor, along with the intrinsic calibration parameters of the camera. At inference time, the calibration of the camera, which can be different than the one used for training, is fed as an input to the network along with the sparse point cloud and a single image. A Calibrated Backprojection Layer backprojects each pixel in the image to three-dimensional space using the calibration matrix and a depth feature descriptor. The resulting 3D positional encoding is concatenated with the image descriptor and the previous layer output to yield the input to the next layer of the encoder. A decoder, exploiting skip-connections, produces a dense depth map. The resulting Calibrated Backprojection Network, or KBNet, is trained without supervision by minimizing the photometric reprojection error. KBNet imputes missing depth value based on the training set, rather than on generic regularization. We test KBNet on public depth completion benchmarks, where it outperforms the state of the art by 30.5% indoor and 8.8% outdoor when the same camera is used for training and testing. When the test camera is different, the improvement reaches 62%. Code available at: https://github.com/alexklwong/calibrated-backprojection-network.

Citations (83)

View on Semantic Scholar

Summary

The paper introduces KBNet, a novel unsupervised deep learning architecture featuring Calibrated Backprojection Layers (KB Layers) that utilize camera calibration parameters as input for dense depth completion.
Experimental results on KITTI and VOID benchmarks show significant improvements over state-of-the-art methods, demonstrating robustness and effectiveness across different environments and sensor calibrations.
This framework's ability to integrate varying calibration parameters allows for flexible deployment across diverse hardware configurations, holding promise for applications like autonomous driving where ground truth depth is scarce.

Overview of Unsupervised Depth Completion with Calibrated Backprojection Layers

The paper "Unsupervised Depth Completion with Calibrated Backprojection Layers" by Wong and Soatto presents a novel deep neural network architecture, referred to as KBNet, designed for inferring dense depth maps from sparse inputs. This architecture leverages unsupervised learning techniques to accomplish depth completion using monocular videos and sparse 3D point clouds obtained from range sensors like LIDAR, without the need for ground-truth annotations. The key innovation lies in its unique use of intrinsic camera calibration parameters as network inputs, which permits the model to generalize across different sensor platforms.

Key Contributions

Calibrated Backprojection Layer (KB Layer): The architecture introduces KB layers that explicitly backproject 2D image pixels into 3D space using a depth feature descriptor and camera calibration matrix. This process enables effective 3D positional encoding, allowing for better modeling of spatial relationships in the scene. Unlike existing methods, this approach incorporates camera intrinsics directly into the model's architecture, rather than relying on fixed parameters during training.
Sparse-to-Dense Module (S2D): The S2D module is employed to transform sparse depth data into a dense representation by leveraging various pooling operations and convolutions. This prepares the depth features before they are processed by the KBNet, ensuring efficient utilization of sparse input data.
Photometric Euclidean Reprojection Loss (PERL): The network is trained using a form of reprojection loss that minimizes the discrepancy between input images and their reconstructions, using estimated depths and known camera poses from monocular sequences. This unsupervised strategy avoids the need for dense depth annotations.
Inductive Bias for Efficient Architecture: By incorporating intrinsic calibration into the architecture, the model achieves a strong inductive bias that enhances performance while maintaining a relatively small computational footprint. This includes using fewer parameters compared to state-of-the-art methods.

Experimental Results

The KBNet was rigorously tested on established benchmarks such as KITTI and VOID. The results demonstrate a substantial improvement over the state of the art. On the KITTI benchmark, the method achieved up to 13.7% improvement over the baseline, and up to 62% improvement when used with different camera calibrations between training and testing. For indoors scenarios on the VOID dataset, the method yielded an average improvement of 30.5% over the best-performing existing approaches. These numerical improvements highlight KBNet's robustness and effectiveness in both indoor and outdoor environments.

Implications and Future Directions

The framework’s ability to integrate varied calibration parameters as part of its input paves the way for flexible deployment across different hardware settings. This feature is particularly valuable for applications that operate under varying conditions, such as autonomous driving or robotic manipulation, where hardware configurations might differ across operational scenarios.

Looking ahead, further explorations may consider enhancing the robustness of calibration adaptability and incorporating more sophisticated loss functions. Additionally, investigating the integration of this architecture in real-time applications could broaden its practical effectiveness. There's potential in incorporating advancements in hardware-accelerated AI platforms to facilitate real-time processing capabilities in embedded systems.

In conclusion, the proposed KBNet framework demonstrates significant strides in unsupervised depth completion, marrying model efficiency with cross-platform adaptability. It holds promise for expansive applications where depth perception is crucial and ground-truth depth data is scarce or unavailable.