Learning monocular depth estimation with unsupervised trinocular assumptions (1808.01606v1)

Published 5 Aug 2018 in cs.CV

Abstract: Obtaining accurate depth measurements out of a single image represents a fascinating solution to 3D sensing. CNNs led to considerable improvements in this field, and recent trends replaced the need for ground-truth labels with geometry-guided image reconstruction signals enabling unsupervised training. Currently, for this purpose, state-of-the-art techniques rely on images acquired with a binocular stereo rig to predict inverse depth (i.e., disparity) according to the aforementioned supervision principle. However, these methods suffer from well-known problems near occlusions, left image border, etc inherited from the stereo setup. Therefore, in this paper, we tackle these issues by moving to a trinocular domain for training. Assuming the central image as the reference, we train a CNN to infer disparity representations pairing such image with frames on its left and right side. This strategy allows obtaining depth maps not affected by typical stereo artifacts. Moreover, being trinocular datasets seldom available, we introduce a novel interleaved training procedure enabling to enforce the trinocular assumption outlined from current binocular datasets. Exhaustive experimental results on the KITTI dataset confirm that our proposal outperforms state-of-the-art methods for unsupervised monocular depth estimation trained on binocular stereo pairs as well as any known methods relying on other cues.

Authors (3)

Matteo Poggi (71 papers)
Fabio Tosi (43 papers)
Stefano Mattoccia (51 papers)

Citations (145)

View on Semantic Scholar

Summary

The paper introduces a novel trinocular training paradigm that uses interleaved stereo datasets to simulate trinocular conditions.
The proposed 3Net model mitigates stereo-specific artifacts and delivers superior depth accuracy on the KITTI dataset.
This approach paves the way for robust real-world applications in robotics, autonomous vehicles, and augmented reality by enhancing depth perception.

Unsupervised Monocular Depth Estimation via Trinocular Assumptions

The paper presented by Poggi et al. explores an innovative approach to monocular depth estimation by introducing a trinocular assumption into the training paradigm. This proposal is motivated by addressing limitations present in the traditional stereo image-based unsupervised training methodologies, which are typically plagued by artifacts around occlusions and image borders in stereo setups.

Methodology and Innovation

The authors propose leveraging a trinocular arrangement during the training of convolutional neural networks (CNNs) to predict disparity. Here, a single central image serves as the reference, while two auxiliary images, on either side, aid in depth estimation. This design seeks to mitigate stereo-specific artifacts by utilizing three disparate viewpoints to enhance depth map accuracy.

The challenge addressed within this framework is the scarcity of trinocular datasets. To this end, the authors propose a novel "interleaved training" procedure. This technique employs existing binocular datasets by alternating the training focus between two stereo configurations, simulating trinocular conditions. This method enables the network to indirectly learn a trinocular representation, even when explicit trinocular data is unavailable.

Empirical Evaluation

The experiments conducted on the KITTI dataset demonstrate substantial improvements over state-of-the-art unsupervised monocular depth estimation methods trained strictly on binocular stereo pairs.

Performance on KITTI Split: The proposed 3Net shows competitive performance even without post-processing. The authors report that 3Net surpasses the methods by Godard et al., achieving better quantitative metrics for most depth error and accuracy measures.
Evaluation on Eigen Split: The evaluations performed using the Eigen split further emphasize the proficiency of 3Net, achieving superior results compared to many unsupervised approaches. Notably, it achieves this without relying on temporal sequence data, which some competing methods utilize.

Analysis and Implications

The efficacy exhibited by 3Net underscores the potential of employing synthetic trinocular setups within monocular depth estimation, especially in contexts where acquiring ground-truth data is infeasible. Practically, this approach could inform the development of applications in areas such as robotics and autonomous vehicles, where efficient and accurate environment perception is crucial.

The authors' method opens a pathway for potential expansions, such as real trinocular training with emerging sensor technologies. Furthermore, it exposes possibilities in tasks involving view synthesis and novel applications like augmented reality, where accurate depth perception from minimal data can significantly enhance user experience.

Future Directions

This work sets a foundation for increased exploration into multi-view synthetic training paradigms. Potential future research could venture into synthesizing additional views to simulate even more extensive multi-baseline conditions. Additionally, integrating this trained model into real-time systems and exploring the operational viability in dynamic environments are exciting avenues for future exploration.

In sum, Poggi et al. have provided a significant contribution to the field of computer vision, demonstrating how strategic modifications in training protocols can lead to substantial advancements in monocular depth estimation, independent from the constraints of direct trinocular data acquisition.

PDF Markdown

Related Papers

YouTube

Show All Videos