Self-supervised Learning for Dense Depth Estimation in Monocular Endoscopy (1806.09521v2)

Published 25 Jun 2018 in cs.CV

Abstract: We present a self-supervised approach to training convolutional neural networks for dense depth estimation from monocular endoscopy data without a priori modeling of anatomy or shading. Our method only requires sequential data from monocular endoscopic videos and a multi-view stereo reconstruction method, e.g. structure from motion, that supervises learning in a sparse but accurate manner. Consequently, our method requires neither manual interaction, such as scaling or labeling, nor patient CT in the training and application phases. We demonstrate the performance of our method on sinus endoscopy data from two patients and validate depth prediction quantitatively using corresponding patient CT scans where we found submillimeter residual errors.

Citations (53)

View on Semantic Scholar

Summary

The paper introduces a self-supervised learning framework for dense depth estimation in monocular endoscopy, utilizing sequential video data and sparse multi-view stereo reconstructions.
The method employs a Siamese network architecture with novel Scale-invariant Weighted Loss and Depth Consistency Loss functions to generate accurate, dense depth maps.
Experimental results show the approach achieves submillimeter residual errors and avoids the need for manual labeling, scaling, or CT scans for training or deployment.

Self-supervised Learning for Dense Depth Estimation in Monocular Endoscopy

This paper delineates a self-supervised methodology for training convolutional neural networks (CNNs) to achieve dense depth estimation from monocular endoscopy data without pre-modeled anatomical or shading information. The proposed approach leverages sequential data from monocular endoscopic videos to enhance the sparse but precise depth data obtained via multi-view stereo reconstruction techniques like Structure from Motion (SfM). Crucially, the method circumvents the necessity for manual scaling, labeling, or CT scans during both training and deployment.

Methodology

The researchers employ a dual-branch Siamese network architecture to process pairs of endoscopic images. The training framework integrates Depth Map Scaling and Depth Map Warping layers, applying novel loss functions: Scale-invariant Weighted Loss and Depth Consistency Loss. These are designed to incorporate sparse depth annotations and enforce spatial coherence among predictions.

The Scale-invariant Weighted Loss introduces scale invariance allowing the network to generalize well across different patients and endoscopes, focusing on predicting correct depth ratios rather than absolute scale. The Depth Consistency Loss fosters spatial constraints, improving dense depth map predictions and mitigating the overfitting issues associated with sparse annotations.

Experimental Evaluation

The experimental setup utilizes RGB endoscopic images, sparse depth maps from SfM, and intrinsic parameters from endoscopes. Training involved data from 22 video subsequences with validation and testing on different scenes from two anonymized patients. Images were registered to corresponding CT-based models for quantitative evaluation, producing submillimeter residual errors, specifically averaging $0.84 \, (\pm \, 0.10)$ \,mm for Patient 1 and $0.63 \, (\pm \, 0.19)$ \,mm for Patient 2.

Implications

The implications of this work are noteworthy in endoscopic navigation systems. The capability to generate dense depth maps from monocular endoscopy could simplify the integration into clinical workflows and potentially reduce costs by obviating external hardware. By self-supervising depth estimation training using sparse reconstructions, the methodology allows for scalable and efficient use of extensive unlabeled endoscopic video data.

Future Directions

Potential future directions involve extending the dataset to ascertain the generalizability across diverse patients, endoscopes, and anatomical structures. Moreover, incorporating multi-frame architectures in place of single-frame depth estimation networks could further augment prediction accuracy. Exploring the inclusion of automated error detection to filter incorrect SfM reconstructions would enhance robustness and reliability.

In summary, the paper presents a robust framework for dense depth estimation that leverages self-supervision through sparse annotations and multi-view stereo reconstructions to eliminate the prerequisite for supervised depth data or CT scans. As the field advances, integrating such self-supervised systems could play a pivotal role in refining computer vision-based navigation for minimally invasive surgical procedures.

Related Papers

YouTube

Show All Videos