- The paper introduces a self-supervised learning framework for dense depth estimation in monocular endoscopy, utilizing sequential video data and sparse multi-view stereo reconstructions.
- The method employs a Siamese network architecture with novel Scale-invariant Weighted Loss and Depth Consistency Loss functions to generate accurate, dense depth maps.
- Experimental results show the approach achieves submillimeter residual errors and avoids the need for manual labeling, scaling, or CT scans for training or deployment.
Self-supervised Learning for Dense Depth Estimation in Monocular Endoscopy
This paper delineates a self-supervised methodology for training convolutional neural networks (CNNs) to achieve dense depth estimation from monocular endoscopy data without pre-modeled anatomical or shading information. The proposed approach leverages sequential data from monocular endoscopic videos to enhance the sparse but precise depth data obtained via multi-view stereo reconstruction techniques like Structure from Motion (SfM). Crucially, the method circumvents the necessity for manual scaling, labeling, or CT scans during both training and deployment.
Methodology
The researchers employ a dual-branch Siamese network architecture to process pairs of endoscopic images. The training framework integrates Depth Map Scaling and Depth Map Warping layers, applying novel loss functions: Scale-invariant Weighted Loss and Depth Consistency Loss. These are designed to incorporate sparse depth annotations and enforce spatial coherence among predictions.
The Scale-invariant Weighted Loss introduces scale invariance allowing the network to generalize well across different patients and endoscopes, focusing on predicting correct depth ratios rather than absolute scale. The Depth Consistency Loss fosters spatial constraints, improving dense depth map predictions and mitigating the overfitting issues associated with sparse annotations.
Experimental Evaluation
The experimental setup utilizes RGB endoscopic images, sparse depth maps from SfM, and intrinsic parameters from endoscopes. Training involved data from 22 video subsequences with validation and testing on different scenes from two anonymized patients. Images were registered to corresponding CT-based models for quantitative evaluation, producing submillimeter residual errors, specifically averaging 0.84(±0.10)\,mm for Patient 1 and 0.63(±0.19)\,mm for Patient 2.
Implications
The implications of this work are noteworthy in endoscopic navigation systems. The capability to generate dense depth maps from monocular endoscopy could simplify the integration into clinical workflows and potentially reduce costs by obviating external hardware. By self-supervising depth estimation training using sparse reconstructions, the methodology allows for scalable and efficient use of extensive unlabeled endoscopic video data.
Future Directions
Potential future directions involve extending the dataset to ascertain the generalizability across diverse patients, endoscopes, and anatomical structures. Moreover, incorporating multi-frame architectures in place of single-frame depth estimation networks could further augment prediction accuracy. Exploring the inclusion of automated error detection to filter incorrect SfM reconstructions would enhance robustness and reliability.
In summary, the paper presents a robust framework for dense depth estimation that leverages self-supervision through sparse annotations and multi-view stereo reconstructions to eliminate the prerequisite for supervised depth data or CT scans. As the field advances, integrating such self-supervised systems could play a pivotal role in refining computer vision-based navigation for minimally invasive surgical procedures.