Colonoscopy Depth Estimation Datasets

Updated 27 December 2025

Colonoscopy depth estimation datasets are collections of colonoscopic videos and images paired with per-pixel ground-truth depth and 3D annotations.
They employ diverse methods including high-fidelity silicone phantoms, synthetic rendering, and multimodal 2D–3D registration to achieve sub-millimeter accuracy.
These datasets enable benchmarking of monocular depth estimation, 3D reconstruction, and navigation algorithms while addressing challenges like specular reflections and non-Lambertian surfaces.

Colonoscopy depth estimation datasets are specialized resources that provide video or image data from colonoscopic examinations, accompanied by per-pixel ground-truth depth information or additional 3D scene annotations. These datasets are critical for the quantitative benchmarking and development of monocular depth estimation, 3D reconstruction, and navigation algorithms in the challenging environment of optical colonoscopy. Accurate depth annotation remains a technical bottleneck due to monocular acquisition, strong specularities, non-Lambertian reflectance of colonic tissue, and the lack of accessible in vivo ground truth. The leading datasets address these challenges using clinical video, high-fidelity silicone phantoms, synthetic rendering, and multimodal 2D–3D registration.

1. Major Public Colonoscopy Depth Datasets

C3VD and C3VDv2

The Colonoscopy 3D Video Dataset (C3VD) and its successor C3VDv2 provide the most widely used colonoscopy depth benchmarks with realistic, quantitative ground-truth. C3VD comprises 22 video sequences (10,015 frames; 1080 × 1350 px) recorded with a high-definition clinical colonoscope traversing accurate, 3D-printed silicone colon phantoms. Pixel-wise metric depth, surface normals, optical flow, occlusion masks, dense 6-DoF pose, coverage maps, and high-fidelity 3D models are supplied for all RGB frames. Depth maps are generated by aligning optical video frames to known phantom geometry via a multimodal 2D–3D registration pipeline: conditional GAN monocular depth prediction is fused with edge-based alignment using evolutionary optimization. This achieves sub-millimeter translation error and sub-degree rotation error in synthetic validation (Bobrow et al., 2022).

C3VDv2 expands this approach to 192 videos (74,071 frames; identical resolution), using 60 unique silicone phantom segments. The dataset covers greater anatomical and visual diversity: four texture variants per segment, debris-filled scenarios, mucous pools, synthetic blood, foam, artifacts obscuring the lens, close-up en-face views, and fast/scoped camera motion. The enhanced realism includes 15 deformation videos (no per-pixel GT) and 8 simulated full-length screening videos (95,300 frames) with synchronized EM-tracker pose but without pixel-wise depth. Depth annotation is achieved via high-resolution mesh registration to physical trajectory and rendering with a virtual fisheye camera. Data formats are provided in 16-bit TIFF for depth/normals/flow, 8-bit PNG for occlusion, and standard 4×4 matrices for pose (Golhar et al., 30 Jun 2025).

SimCol3D

SimCol3D is a fully synthetic benchmark created for the MICCAI EndoVis 2022 challenge on 3D colonoscopy reconstruction. It comprises three patient-specific colon models, each mesh derived from clinical CT data to realistically capture lumen geometry, folds, and diameter variance. All frames are rendered in Unity 3D under floodlight illumination and near-Lambertian reflectance using a pinhole camera model matched to clinical intrinsics; only RGB is used for depth estimation (475 × 475 px). Depth for each pixel is recorded as the z-distance along the camera ray to the first visible surface, normalized in the range 0, 1. The dataset is structured into three subsets: SynCol I (public mesh), SynCol II (Patient A), SynCol III (Patient B, held out). Training data includes 24 trajectories (14,412 frames total); held-out tests use an additional 9,009 frames. Each synthetic frame is paired with depth, intrinsics, and camera pose. Real clinical sequences for pose estimation are also provided, leveraging pseudo ground-truth from the EndoMapper corpus and COLMAP (Rau et al., 2023).

Other Datasets: Synthetic and Real-World Clinical

Dataset	Modality	Size
C3VD	Phantom/Real	10,015 frames
C3VDv2	Phantom/Simulated	74,071 (+deformation, +screening)
SimCol3D	Synthetic	23,421 frames (33 sequences)
Ruano et al. (2023)	Synthetic	248,400 frames
Hyper Kvasir	Clinical Video	16,976 train + 786 test (no GT depth)
Structure-Preserving	Clinical sequences	17,500 frames (oblique/en face)

(For detailed acquisition and content, see sections below.)

2. Ground-Truth Acquisition and Annotation Protocols

Phantom Datasets and 2D–3D Registration

In C3VD/C3VDv2, underlying phantom models are digitally sculpted with anatomic fidelity (sigmoid to cecum), 3D-printed, and cast in multi-layer pigmented silicone to emulate mucosal vasculature and specular reflectance (Bobrow et al., 2022, Golhar et al., 30 Jun 2025). The colonoscope is steered through each segment by robotic or manual path; full 6-DoF trajectories are captured (robot/EM). At each recorded pose, a virtual camera aligned with physical intrinsics is used to render metric depth and normals from the digital mesh.

The ground-truth registration pipeline fuses cGAN-based monocular depth predictions, edge-based Canny/Gaussian feature alignment, and evolutionary optimization (CMA-ES) to minimize projection error between recorded and rendered edge maps over K keyframes. The registration objective is: $\mathbf{T}_{\mathrm{final}} = \arg\min_{\mathbf{T}}\bigl[\,1 - \mathrm{sim}\bigl(E(D_T(\mathbf{T})), E(\hat{D})\bigr)\bigr]$ where $\hat{D}$ is the predicted depth from the cGAN, and $E()$ denotes Canny+Gaussian feature extraction. Average translation/rotation error is 0.321 mm and 0.159°, further improved (down to 0.143 mm/0.063°) when leveraging multi-frame context (K=5) (Bobrow et al., 2022).

Synthetic Datasets

SimCol3D's synthetic data uses 3D meshes from CT reconstructions, rendered in Unity 3D with variable centerline paths, randomized initial pose, and realistic clinical intrinsics. Ground-truth per-pixel depth $Y(d)$ is obtained by z-buffer simulation; for each pixel, the camera-to-surface distance is stored, normalized, and later re-expanded to metric. Each frame's absolute pose and camera intrinsics are saved per frame (Rau et al., 2023).

The synthetic dataset of Ruano et al. (2023) provides 248,400 frames across five visual complexity levels, rendered in Blender using a cone light source and 110° field of view. Gamma correction ( $\gamma = 0.66$ ) is applied to emphasize depths closer than 15 cm. Values range from 0–25 cm (Ruano et al., 2023).

3. Dataset Structure, Diversity, and Benchmark Splits

C3VD and C3VDv2

C3VD contains 22 videos, covering four colon regions and multiple textural/trajectory variations, with train, validation, and test splits at the video level. The test set includes one previously unseen segment (descending colon), supporting generalization studies. All data are provided at full 1080 × 1350 px resolution.

C3VDv2 enhances anatomical and visual challenge with debris, fluids, blood, and lens artifacts. It divides data into clean and debris-filled segments and deformation videos. Suggested splits include: (1) segmentation-based (e.g., c1 for train, c2 for test), and (2) stratified 80/10/10 across trajectory/artifact modes (Golhar et al., 30 Jun 2025).

SimCol3D

SimCol3D divides synthetic data into three meshes (SynCol I–III), with patient-specific meshes for training and test, plus an unseen held-out patient mesh. Real clinical pose-only data augment the synthetic benchmark for pose estimation (Rau et al., 2023).

Clinical Video and Sim2Real Data

Hyper Kvasir provides a large corpus of patient-colonoscopy frames with high mucosal visualization; no explicit depth annotation is available, supporting self-supervised approaches (Daher et al., 18 Feb 2025). Structure-preserving pipelines (Wang et al., 19 Aug 2024) curate 17,500 viewpoint-specific clinical frames (oblique and en-face), augmenting the diversity and challenge for sim-to-real adaptation.

Ruano et al. (2023) provide synthetic videos across five anatomical complexity levels, enabling curriculum learning and performance ablation by scene realism (Ruano et al., 2023).

4. Data Modalities and File Formats

C3VD v1/v2 and SimCol3D datasets offer multimodal annotation for each RGB frame:

Depth maps: 16-bit grayscale PNG (C3VD) or TIFF (C3VDv2), physical range 0,100 mm or 0,25 cm
Surface normals: 16-bit RGB channels, per-pixel in camera coordinates
Optical flow: 16-bit color (Δu, Δv)
Occlusion masks: 8-bit binary, occlusion within ≥100 mm
6-DoF camera pose: 4×4 float32 matrix, per frame
Coverage map: Binary texture, per-mesh-face label denotes observed status
3D models: OBJ files per segment, plus “coverage_mesh.obj” with visibility tags
Intrinsics: Pinhole or fisheye matrices (JSON or text)
Sequence structure: Directory by colon segment, texture, and artifact; e.g., /C3VDv2/RegisteredVideos/colon_segment_texture_v#/

SimCol3D data are provided as frame-level PNGs and per-frame text files for intrinsics and pose (Rau et al., 2023). C3VDv2 delivers high frame rate (∼60 fps) and includes camera calibration data and human-segmented coverage meshes (Golhar et al., 30 Jun 2025).

5. Evaluation Protocols and Depth-Estimation Metrics

Colonoscopy depth datasets adopt a consensus set of quantitative metrics for benchmarking, including:

Root Mean Squared Error (RMSE):

$\mathrm{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^N (d_i - d_i^*)^2}$

Mean Absolute Error (MAE):

$\mathrm{MAE} = \frac{1}{N} \sum_{i=1}^N |d_i - d_i^*|$

Absolute Relative Error (AbsRel):

$\mathrm{AbsRel} = \frac{1}{N}\sum_{i=1}^N \frac{|d_i - d_i^*|}{d_i}$

Squared Relative Error (SqRel):

$\mathrm{SqRel} = \frac{1}{N}\sum_{i=1}^N \frac{(d_i - d_i^*)^2}{d_i}$

RMSE (log):

$\mathrm{RMSE}_{\log} = \sqrt{\frac{1}{N} \sum_{i=1}^N (\log d_i - \log d_i^*)^2}$

Threshold accuracy ( $\delta$ -accuracy):

$\delta < \tau: \%\,\text{pixels satisfying}\;\max\!\bigl(\tfrac{d_i}{d_i^*},\tfrac{d_i^*}{d_i}\bigr) < \tau,\quad\tau=1.25,1.25^2,1.25^3$

Boundary F1 score: Edge precision, recall, and F1 over depth discontinuities at multiple thresholds (Li et al., 20 Dec 2025).
Frame variance ( $\sigma$ ): Temporal stability of metric scale per frame (Li et al., 20 Dec 2025).

SimCol3D challenge employs L1, median relative error, and RMSE with per-trajectory scale alignment; no log-RMSE or δ-accuracy metrics are used (Rau et al., 2023).

6. Challenges, Limitations, and Recommendations

Limitations shared across datasets include:

Phantom datasets: lack dynamic tissue deformation, peristalsis, and fluids; only C3VDv2 includes deformation and debris challenges, but only with non-pixelwise GT for deformation (Golhar et al., 30 Jun 2025).
Synthetic data: domain gap to real tissue (texture, illumination, specular behavior) remains significant. Sim2real style transfer and adversarial domain adaptation are recommended to close this gap (Wang et al., 19 Aug 2024, Daher et al., 18 Feb 2025).
Clinical data: Full in vivo, per-pixel depth annotation is currently infeasible; datasets like Hyper Kvasir enable self-supervised evaluation but do not support quantitative validation (Daher et al., 18 Feb 2025).
Optical diversity: Wide FoV and variable illumination (170° fisheye) induce distortions; all major datasets employ undistortion or explicit calibration per acquisition (Bobrow et al., 2022, Golhar et al., 30 Jun 2025).

Practical recommendations include:

Training on diverse synthetic/phantom sources (C3VDv2, SimCol3D, Ruano), validating on standard splits and challenging scenarios (debris, en-face, high-speed).
Incorporating additional tasks (e.g., normals, optical flow, pose) or geometric losses (e.g., cross-task consistency, shape-from-shading) to regularize learning under low-texture, specular, or occluded regimes (Solano et al., 2023).
Applying domain adaptation or sim2real pipelines (e.g., structure-preserving CycleGAN) to transfer synthetic depth supervision to clinical video, using curated real-scene targets (oblique, en-face) for improved sim2real generalization (Wang et al., 19 Aug 2024).

7. Access, Licensing, and Community Use

C3VD and C3VDv2: Freely distributed for research under a CC BY-NC-SA 4.0 license. Public release includes dataset, ground-truth, meshes, code, and fabrication protocols at https://durr.jhu.edu/C3VD and the project webpage (Golhar et al., 30 Jun 2025).
SimCol3D: Data, evaluation scripts, and challenge splits are available at https://www.ucl.ac.uk/interventional-surgical-sciences/simcol3d-data and the GitHub repository https://github.com/anitarau/simcol (Rau et al., 2023).
Clinical video sequences (oblique/en-face): Publicly released for non-human-subjects research at https://endoscopography.web.unc.edu (Wang et al., 19 Aug 2024).
Ruano synthetic dataset: Released for research by CIM@LAB, Universidad Nacional de Colombia (Ruano et al., 2023).

These datasets collectively provide a comprehensive, multi-modal foundation for the development and benchmarking of monocular depth, 3D reconstruction, and navigation algorithms in colonoscopy. The ongoing incorporation of clinical realism (artifacts, deformation, domain adaptation) and explicit depth/geometry annotation continues to advance the rigor and utility of these benchmarks across the computer vision and gastroenterology communities.