EndoNeRF: Neural Rendering in Endoscopic Surgery

Updated 26 November 2025

EndoNeRF is a neural rendering framework that reconstructs deformable endoscopic scenes by integrating dynamic radiance fields with deformation modeling.
It employs a canonical radiance field alongside a deformation MLP to capture nonrigid motion and achieve state-of-the-art photometric and geometric performance.
The framework supports stereo, monocular, and untracked-camera settings using domain-specific strategies, such as tool mask–guided ray selection, for enhanced accuracy.

EndoNeRF refers to a class of neural rendering frameworks designed for high-fidelity reconstruction of deformable tissue from endoscopic video. Integrating dynamic neural radiance fields (NeRFs) with domain-specific priors and sampling strategies, EndoNeRF enables volumetric 3D modeling of soft tissue dynamics under challenging surgical conditions, including monocular viewpoints, large nonrigid deformations, and tool occlusions. First introduced for stereo robotic surgery scenes, subsequent variants have expanded to monocular and untracked-camera settings, rendering EndoNeRF central to data-driven surgical simulation, intraoperative guidance, and medical image computing (Wang et al., 2022, Saha et al., 2023, Wang et al., 2024).

1. Core Architectural Principles

EndoNeRF leverages dynamic neural radiance fields—implicit scene representations parameterized by multilayer perceptrons (MLPs)—to encode volumetric density and RGB color as a function of spatial position, viewing direction, and time. The canonical radiance field MLP models the static tissue template, while a separate deformation/displacement MLP maps each 3D point and time to a canonical configuration, allowing for efficient modeling of nonrigid motion and topology change.

Mathematically, the system comprises the following modules (Wang et al., 2022, Saha et al., 2023):

Canonical Radiance Field: Given by

$\mathrm{erf}(\mathbf{x}, \mathbf{d}) = (\,\sigma(\mathbf{x}),\,c(\mathbf{x}, \mathbf{d})\,)\,,$

with volume density $\sigma(\mathbf{x})$ and view-dependent color $c(\mathbf{x}, \mathbf{d})$ .

Deformation Field: Models nonrigid motion as

$\mathbf{x}_{\text{canonical}} = \mathbf{x} + G_\Phi(\mathbf{x}, t)\,,$

where $G_\Phi$ is the time-dependent displacement MLP.

Volume Rendering: Renders color along rays using

$\hat{C}(r) = \sum_{j=1}^{m-1} w_j\,c(\mathbf{x}_j, \mathbf{d})\,,$

with weights $w_j$ derived from accumulated opacity via alpha compositing.

These architectures are trained to predict both photometric appearance and geometric depth based on multi-frame endoscopic imagery, using positional encoding of spatial and temporal inputs.

2. Dataset, Inputs, and Preprocessing

EndoNeRF pipelines operate on time-sequenced endoscopic videos, with system variants supporting both stereo and monocular input regimes. Key inputs include (Wang et al., 2022, Saha et al., 2023):

Stereo Pair Frames $(I_i^L, I_i^R)$ (if available), enabling depth cue extraction.
Per-frame Tool Masks $M_i$ for identifying and excluding surgical instruments from learning and rendering; generated via manual labeling or segmentation networks.
Depth Maps $D_i$ estimated via a stereo matching network (e.g., STTR-light), providing surface priors even in challenging viewing conditions.
Camera Intrinsics and (in some systems) extrinsics, with some variants (see BASED) supporting unknown camera poses with joint pose optimization.

Preprocessing steps include camera calibration, mask preparation, and frame selection to maximize tissue visibility and deformation coverage.

3. Training Methodology and Loss Functions

EndoNeRF systems adopt volumetric rendering supervision, with a composite loss integrating appearance and geometry priors (Wang et al., 2022):

Photometric Supervision: $L_\text{pho} = \| \hat{C}(r) - I^L_i[u,v] \|_2^2$ applied on sampled rays.
Depth Supervision: $L_\text{depth} = \lambda |\hat{D}(r) - D_i[u,v]|$ encourages rendered depth to match stereo-derived surface priors.
Correspondence Loss (in bundle-adjusting variants): $L_\text{corr}$ enforces spatiotemporal consistency of 3D points across frames.
Statistical Depth Refinement: After a set number of steps ( $K$ ), outlier pixels in $D_i$ (identified by residuals to current rendered depth) are replaced with smoothed synthetic depths, mitigating overfitting to stereo artifacts.

Rays for supervision are selected by importance sampling over tissue regions (using tool masks) and by focusing depth samples near the stereo-estimated surface using a Gaussian transfer function:

$\delta(s;\,u, v, i) = \exp\left( -\frac{(s - D_i[u,v])^2}{2 \xi^2} \right)$

with $\xi$ controlling the concentration around the predicted depth.

Optimization is conducted using the Adam optimizer, with per-sequence models trained over approximately $10^5$ gradient steps.

4. Ray Selection and Domain-specific Strategies

Challenges specific to surgical environments—frequent tool occlusions, specularity, severe nonrigid motion—are addressed in EndoNeRF through several algorithmic strategies (Wang et al., 2022):

Tool Mask–Guided Ray Casting: Importance sampling over non-tool pixels ensures that rays through instruments do not contribute to loss computation, preventing erroneous surface modeling over occlusions.
Stereo Depth–Cueing Ray Marching: Ray samples are distributed according to depth priors, enabling robust geometry estimation from a single endoscopic vantage.
Iterative Depth Refinement: Corrections of unreliable or noisy stereo disparities via residual-driven updating of depth maps during training cycles.

These domain-tailored mechanisms are critical for reconstructing watertight, spatially coherent surfaces under challenging surgical conditions.

5. Evaluation and Quantitative Performance

Performance of EndoNeRF is validated on in-house robotic surgery sequences, with metrics including PSNR, SSIM, and LPIPS evaluated on held-out video frames (Wang et al., 2022):

Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$
E-DSSR (SLAM+depth)	$13.40 \pm 1.39$	$0.630 \pm 0.057$	$0.423 \pm 0.047$
EndoNeRF	$29.83 \pm 2.21$	$0.925 \pm 0.020$	$0.081 \pm 0.022$

Ablations confirm that excluding the deformation network (“Ours w/o D”) results in performance drops (PSNR $24.09$, SSIM $0.849$, LPIPS $0.230$), underscoring the importance of explicit modeling of nonrigid motion. Qualitatively, EndoNeRF reconstructs continuous surfaces under traction, occlusion, and tissue cutting, where prior SLAM-based or static NeRF pipelines exhibit holes and noise.

6. Extensions to Untracked and Monocular Regimes

Subsequent generalizations, such as the BASED framework, address scenarios where camera extrinsics are unknown and monocular endoscopy is required (Saha et al., 2023). These adopt:

Bundle-Adjusting Camera Pose Optimization: Per-frame SE(3) pose parameters are learned jointly with radiance/deformation fields.
Multi-view Dynamic Correspondence Losses: Enforced with learned optical flow networks to maintain consistency of deformation fields across frames and time.
Flexible Depth Supervision: Support for both depth-provided and fully monocular pipelines.

Comparative evaluation to EndoNeRF and other contemporary methods (e.g., RoDynRF) shows that BASED achieves higher PSNR/SSIM and lower LPIPS on datasets with varying tissue dynamics.

7. Applications and Limitations

EndoNeRF underpins applications in data-driven surgical simulation, realistic intraoperative scene reconstruction, and training/assessment for image-guided robotic interventions. Pipeline outputs—volumetric reconstructions and, in some variants, watertight surface meshes—can be used in downstream finite element and material point method simulations, as reported for data-driven soft tissue physics (Wang et al., 2024).

Current limitations include optimization time (hours per sequence), GPU memory demands, strict dependence on high-quality tool masks, and challenges with specularities and topological change. Potential improvements include hash-grid neural fields for faster convergence, automated segmentation uncertainty, and enhanced reflectance modeling.

EndoNeRF offers a principled, extensible framework for high-fidelity modeling of deformable scenes in surgical endoscopy, integrating advances in neural rendering, nonrigid deformation modeling, and surgical robotics. It establishes state-of-the-art performance for reconstructing tissue geometry and appearance from video data, and ongoing research continues to address computational efficiency and robustness under diverse clinical conditions (Wang et al., 2022, Saha et al., 2023, Wang et al., 2024).

Markdown Upgrade to Chat

References (3)

Neural Rendering for Stereo 3D Reconstruction of Deformable Tissues in Robotic Surgery (2022)

BASED: Bundle-Adjusting Surgical Endoscopic Dynamic Video Reconstruction using Neural Radiance Fields (2023)

Efficient EndoNeRF Reconstruction and Its Application for Data-driven Surgical Simulation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EndoNeRF.