EndoNeRF: Neural Rendering in Endoscopic Surgery
- EndoNeRF is a neural rendering framework that reconstructs deformable endoscopic scenes by integrating dynamic radiance fields with deformation modeling.
- It employs a canonical radiance field alongside a deformation MLP to capture nonrigid motion and achieve state-of-the-art photometric and geometric performance.
- The framework supports stereo, monocular, and untracked-camera settings using domain-specific strategies, such as tool mask–guided ray selection, for enhanced accuracy.
EndoNeRF refers to a class of neural rendering frameworks designed for high-fidelity reconstruction of deformable tissue from endoscopic video. Integrating dynamic neural radiance fields (NeRFs) with domain-specific priors and sampling strategies, EndoNeRF enables volumetric 3D modeling of soft tissue dynamics under challenging surgical conditions, including monocular viewpoints, large nonrigid deformations, and tool occlusions. First introduced for stereo robotic surgery scenes, subsequent variants have expanded to monocular and untracked-camera settings, rendering EndoNeRF central to data-driven surgical simulation, intraoperative guidance, and medical image computing (Wang et al., 2022, Saha et al., 2023, Wang et al., 10 Apr 2024).
1. Core Architectural Principles
EndoNeRF leverages dynamic neural radiance fields—implicit scene representations parameterized by multilayer perceptrons (MLPs)—to encode volumetric density and RGB color as a function of spatial position, viewing direction, and time. The canonical radiance field MLP models the static tissue template, while a separate deformation/displacement MLP maps each 3D point and time to a canonical configuration, allowing for efficient modeling of nonrigid motion and topology change.
Mathematically, the system comprises the following modules (Wang et al., 2022, Saha et al., 2023):
- Canonical Radiance Field: Given by
with volume density and view-dependent color .
- Deformation Field: Models nonrigid motion as
where is the time-dependent displacement MLP.
- Volume Rendering: Renders color along rays using
with weights derived from accumulated opacity via alpha compositing.
These architectures are trained to predict both photometric appearance and geometric depth based on multi-frame endoscopic imagery, using positional encoding of spatial and temporal inputs.
2. Dataset, Inputs, and Preprocessing
EndoNeRF pipelines operate on time-sequenced endoscopic videos, with system variants supporting both stereo and monocular input regimes. Key inputs include (Wang et al., 2022, Saha et al., 2023):
- Stereo Pair Frames (if available), enabling depth cue extraction.
- Per-frame Tool Masks for identifying and excluding surgical instruments from learning and rendering; generated via manual labeling or segmentation networks.
- Depth Maps estimated via a stereo matching network (e.g., STTR-light), providing surface priors even in challenging viewing conditions.
- Camera Intrinsics and (in some systems) extrinsics, with some variants (see BASED) supporting unknown camera poses with joint pose optimization.
Preprocessing steps include camera calibration, mask preparation, and frame selection to maximize tissue visibility and deformation coverage.
3. Training Methodology and Loss Functions
EndoNeRF systems adopt volumetric rendering supervision, with a composite loss integrating appearance and geometry priors (Wang et al., 2022):
- Photometric Supervision: applied on sampled rays.
- Depth Supervision: encourages rendered depth to match stereo-derived surface priors.
- Correspondence Loss (in bundle-adjusting variants): enforces spatiotemporal consistency of 3D points across frames.
- Statistical Depth Refinement: After a set number of steps (), outlier pixels in (identified by residuals to current rendered depth) are replaced with smoothed synthetic depths, mitigating overfitting to stereo artifacts.
Rays for supervision are selected by importance sampling over tissue regions (using tool masks) and by focusing depth samples near the stereo-estimated surface using a Gaussian transfer function:
with controlling the concentration around the predicted depth.
Optimization is conducted using the Adam optimizer, with per-sequence models trained over approximately gradient steps.
4. Ray Selection and Domain-specific Strategies
Challenges specific to surgical environments—frequent tool occlusions, specularity, severe nonrigid motion—are addressed in EndoNeRF through several algorithmic strategies (Wang et al., 2022):
- Tool Mask–Guided Ray Casting: Importance sampling over non-tool pixels ensures that rays through instruments do not contribute to loss computation, preventing erroneous surface modeling over occlusions.
- Stereo Depth–Cueing Ray Marching: Ray samples are distributed according to depth priors, enabling robust geometry estimation from a single endoscopic vantage.
- Iterative Depth Refinement: Corrections of unreliable or noisy stereo disparities via residual-driven updating of depth maps during training cycles.
These domain-tailored mechanisms are critical for reconstructing watertight, spatially coherent surfaces under challenging surgical conditions.
5. Evaluation and Quantitative Performance
Performance of EndoNeRF is validated on in-house robotic surgery sequences, with metrics including PSNR, SSIM, and LPIPS evaluated on held-out video frames (Wang et al., 2022):
| Method | PSNR | SSIM | LPIPS |
|---|---|---|---|
| E-DSSR (SLAM+depth) | |||
| EndoNeRF |
Ablations confirm that excluding the deformation network (“Ours w/o D”) results in performance drops (PSNR $24.09$, SSIM $0.849$, LPIPS $0.230$), underscoring the importance of explicit modeling of nonrigid motion. Qualitatively, EndoNeRF reconstructs continuous surfaces under traction, occlusion, and tissue cutting, where prior SLAM-based or static NeRF pipelines exhibit holes and noise.
6. Extensions to Untracked and Monocular Regimes
Subsequent generalizations, such as the BASED framework, address scenarios where camera extrinsics are unknown and monocular endoscopy is required (Saha et al., 2023). These adopt:
- Bundle-Adjusting Camera Pose Optimization: Per-frame SE(3) pose parameters are learned jointly with radiance/deformation fields.
- Multi-view Dynamic Correspondence Losses: Enforced with learned optical flow networks to maintain consistency of deformation fields across frames and time.
- Flexible Depth Supervision: Support for both depth-provided and fully monocular pipelines.
Comparative evaluation to EndoNeRF and other contemporary methods (e.g., RoDynRF) shows that BASED achieves higher PSNR/SSIM and lower LPIPS on datasets with varying tissue dynamics.
7. Applications and Limitations
EndoNeRF underpins applications in data-driven surgical simulation, realistic intraoperative scene reconstruction, and training/assessment for image-guided robotic interventions. Pipeline outputs—volumetric reconstructions and, in some variants, watertight surface meshes—can be used in downstream finite element and material point method simulations, as reported for data-driven soft tissue physics (Wang et al., 10 Apr 2024).
Current limitations include optimization time (hours per sequence), GPU memory demands, strict dependence on high-quality tool masks, and challenges with specularities and topological change. Potential improvements include hash-grid neural fields for faster convergence, automated segmentation uncertainty, and enhanced reflectance modeling.
EndoNeRF offers a principled, extensible framework for high-fidelity modeling of deformable scenes in surgical endoscopy, integrating advances in neural rendering, nonrigid deformation modeling, and surgical robotics. It establishes state-of-the-art performance for reconstructing tissue geometry and appearance from video data, and ongoing research continues to address computational efficiency and robustness under diverse clinical conditions (Wang et al., 2022, Saha et al., 2023, Wang et al., 10 Apr 2024).