EndoNeRF Pulling Benchmark

Updated 9 December 2025

EndoNeRF Pulling Benchmark is a standardized evaluation protocol for 3D reconstruction in dynamic, minimally invasive surgical scenes.
It assesses reconstruction fidelity and real-time performance using metrics like peak signal-to-noise ratio (PSNR) and FPS > 60, ensuring intra-operative applicability.
The benchmark incorporates advanced techniques such as Gaussian-based neural representations and HexPlane feature grids to address non-rigid soft-tissue deformations and occlusions.

The EndoNeRF Pulling Benchmark is a standardized evaluation protocol for 3D reconstruction pipelines tailored to minimally invasive surgical scenes characterized by dynamic, non-rigid soft-tissue deformation and significant occlusion due to surgical tool interaction. Originally derived from the “pulling” sequence of the EndoNeRF dataset (Wang et al. 2022, 2024), it serves as a critical testbed for assessing the speed, accuracy, and practicality of real-time volumetric modeling systems in surgical contexts, with an emphasis on intra-operative deployability and commercial compatibility (Nath et al., 2 Dec 2025).

1. Benchmark Dataset and Protocol

The EndoNeRF Pulling Benchmark employs a specific dataset segment featuring 63 temporally ordered monocular or stereo endoscopic frames at $640 \times 512$ resolution, capturing soft-tissue being dynamically manipulated by surgical instruments. Each frame provides calibrated camera intrinsics and extrinsics, per-frame RGB images, stereo-derived depth maps, and binary segmentation masks distinguishing tools from tissue.

The principal reconstruction task targets scenes exhibiting dynamic, non-rigid soft-tissue deformation due to instrument pulling, with explicit tool–tissue interactions and extensive occlusions. For each sequence, all 63 frames, along with their associated depth and segmentation masks, are used for model training. For evaluation, the protocol specifies rendering novel views at held-out frames (using the same camera poses) and computing peak signal-to-noise ratio (PSNR) in decibels, comparing rendered RGB to ground truth. In addition, real-time capability is defined operationally as the ability to render at greater than 60 frames per second (FPS) on the specified target hardware.

2. Quantitative Results and Comparative Analysis

Performance on the EndoNeRF Pulling Benchmark is measured by reconstruction fidelity (PSNR) and real-time throughput (FPS $> 60$ ). Across published methods, results are summarized as follows:

Method	Training Time per Scene	PSNR (Pulling) [dB]	FPS > 60	License
EndoNeRF (NeRF)	~6 h	35.43	✗	Commercial
EndoSurf	~7 h	34.91	✗	MIT
LerPlane-32k	8 min	31.77	✗	BSD-3
Endo-4DGS	4 min	37.85	✓	Non-Commercial derivative
EndoGaussian	≤2 min	37.848	✓	Non-Commercial derivative
Deform3DGS	≈1 min	37.90	✓	Non-Commercial derivative
Surgical Gaussian Surfels	45 min	39.06	✓	Non-Commercial derivative
G-SHARP (Apache-2.0)	2 min	37.98	✓	Commercial (Apache-2.0)

G-SHARP achieves 37.98 dB PSNR after 2 minutes of training per scene and reliably exceeds 60 FPS, confirming suitability for intra-operative deployment while maintaining full commercial license compatibility. Additional benchmarks indicate that during training, G-SHARP yields 24–27 dB PSNR in full-scene mode and 17–19 dB for tissue-only masked evaluation across EndoNeRF scenes (Nath et al., 2 Dec 2025).

3. Algorithmic Components of State-of-the-Art Pipelines

The G-SHARP pipeline exemplifies algorithmic advances in Gaussian-based neural representations:

GSplat Differentiable Gaussian Rasterizer: Each 3D Gaussian primitive $i$ is parameterized by mean $\mu_i \in \mathbb{R}^3$ , covariance $\Sigma_i$ (with rotation $q_i$ and scales $s_i$ ), opacity $\alpha_i \in [0,1]$ , and color coefficients $c_i$ (spherical harmonics, up to degree $\ell=3$ ). Projection to the image plane is effected via the intrinsic matrix and extrinsics, with image-plane splat weight:

$w_i(x) = \alpha_i \exp\left( -\frac{1}{2} (x - \mu^{2D}_i)^{T} (\Sigma^{2D}_i)^{-1} (x - \mu^{2D}_i) \right)$

Front-to-back compositing yields pixel color:

$C(x) = \sum_i w_i(x) \left( 1 - \prod_{j<i} (1-w_j(x)) \right) c_i$

CUDA-optimized kernel implementations enable efficient multi-view batch rasterization and memory usage.

Deformable Tissue Reconstruction: Training proceeds in two stages: a coarse geometry initialization (200 iterations, ≈30 s, no deformation) followed by a fine optimization (1,500 iterations, ≈1.5 min) integrating deformation and appearance. The deformation network features a HexPlane grid of features in $(x,y,z,t)$ , factorized into six 2D planes with MLP decoding (8 layers/256 units) predicting per-primitive offsets $(\Delta\mu, \Delta s, \Delta q, \Delta \alpha)$ . The deformed parameters update as:

$\mu'_i = \mu_i + \Delta\mu_i,\quad s'_i = s_i + \Delta s_i,\quad q'_i = \mathrm{quat\_normalize}(q_i + \Delta q_i),\quad \alpha'_i = \alpha_i + \Delta\alpha_i$

Composite Loss Function: Fine-stage optimization employs

$\mathcal{L} = \|I_{\mathrm{render}} - I_{\mathrm{gt}}\|_{1} + \lambda_{\mathrm{depth}} \mathcal{L}_{\mathrm{depth}} + \lambda_{\mathrm{SSIM}} (1-\mathrm{SSIM}(I_{\mathrm{render}}, I_{\mathrm{gt}})) + \lambda_{\mathrm{TV}} \mathrm{TV}(I_{\mathrm{render}}|_{\mathrm{masked}}) + \mathcal{L}_{\mathrm{deform}}$

where $\lambda_{\mathrm{depth}} = 0.001$ , $\lambda_{\mathrm{SSIM}} = 0.2$ , $\lambda_{\mathrm{TV}} = 0.03$ . Regularization includes time-smoothness, $L_1$ norm, and TV on spatial/temporal HexPlanes (all with weight 0.01).

Occlusion Handling: The pipeline distinguishes “tissue-only” (tool-excluded) and “full-scene” (tissue+tool) rendering. Tool masks are systematically enforced during initialization and as supervision. An “invisible mask” aggregates all tool pixels across time, applying a total variation penalty to encourage smooth interpolation in chronically occluded regions.

4. Speed–Accuracy Trade-Offs and Practical Implications

The trade-off between speed and reconstruction fidelity is central in the benchmark. G-SHARP achieves a favorable balance: 2 minutes of training yields 37.98 dB PSNR and real-time rendering above 60 FPS. Methods such as Surgical Gaussian Surfels reach marginally higher fidelity (39.06 dB) at significantly higher compute cost (45 min training). NeRF-based pipelines require up to 6–7 hours for considerably lower fidelity ( $\sim$ 35 dB PSNR).

Further parameter tuning—such as adjusting Gaussian densification schedules, HexPlane resolutions, MLP size, or loss-term weights—enables trade of a few dB in PSNR for drastically lower training time or higher FPS (e.g., surpassing 100 FPS or sub-1 min training), though detailed curves are omitted. Table anchor points define key speed-accuracy operating regimes (Nath et al., 2 Dec 2025).

5. Deployment and Integration in Surgical Environments

Deployment of G-SHARP targets commercial, intra-operative use via the Holoscan SDK, leveraging NVIDIA IGX Orin and Thor edge systems. The Holoscan application orchestrates pipeline stages using modular operators—EndoNeRFLoaderOp, GsplatLoaderOp, GsplatRenderOp, HolovizOp, ImageSaverOp—processing only camera poses and timestamps at run-time, with all computationally intensive RGB/depth operations conducted offline.

On both IGX Orin and Thor hardware, the solution sustains $>$ 60 FPS at $640\times 512$ resolution with per-frame latency under 16 ms. The commercial Apache-2.0 license and minimal dependency stack facilitate seamless operating room integration, with the performance profile satisfying intra-operative visualization requirements (Nath et al., 2 Dec 2025).

Markdown Upgrade to Chat

References (1)

G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EndoNeRF Pulling Benchmark.