EndoNeRF Pulling Benchmark
- EndoNeRF Pulling Benchmark is a standardized evaluation protocol for 3D reconstruction in dynamic, minimally invasive surgical scenes.
- It assesses reconstruction fidelity and real-time performance using metrics like peak signal-to-noise ratio (PSNR) and FPS > 60, ensuring intra-operative applicability.
- The benchmark incorporates advanced techniques such as Gaussian-based neural representations and HexPlane feature grids to address non-rigid soft-tissue deformations and occlusions.
The EndoNeRF Pulling Benchmark is a standardized evaluation protocol for 3D reconstruction pipelines tailored to minimally invasive surgical scenes characterized by dynamic, non-rigid soft-tissue deformation and significant occlusion due to surgical tool interaction. Originally derived from the “pulling” sequence of the EndoNeRF dataset (Wang et al. 2022, 2024), it serves as a critical testbed for assessing the speed, accuracy, and practicality of real-time volumetric modeling systems in surgical contexts, with an emphasis on intra-operative deployability and commercial compatibility (Nath et al., 2 Dec 2025).
1. Benchmark Dataset and Protocol
The EndoNeRF Pulling Benchmark employs a specific dataset segment featuring 63 temporally ordered monocular or stereo endoscopic frames at resolution, capturing soft-tissue being dynamically manipulated by surgical instruments. Each frame provides calibrated camera intrinsics and extrinsics, per-frame RGB images, stereo-derived depth maps, and binary segmentation masks distinguishing tools from tissue.
The principal reconstruction task targets scenes exhibiting dynamic, non-rigid soft-tissue deformation due to instrument pulling, with explicit tool–tissue interactions and extensive occlusions. For each sequence, all 63 frames, along with their associated depth and segmentation masks, are used for model training. For evaluation, the protocol specifies rendering novel views at held-out frames (using the same camera poses) and computing peak signal-to-noise ratio (PSNR) in decibels, comparing rendered RGB to ground truth. In addition, real-time capability is defined operationally as the ability to render at greater than 60 frames per second (FPS) on the specified target hardware.
2. Quantitative Results and Comparative Analysis
Performance on the EndoNeRF Pulling Benchmark is measured by reconstruction fidelity (PSNR) and real-time throughput (FPS ). Across published methods, results are summarized as follows:
| Method | Training Time per Scene | PSNR (Pulling) [dB] | FPS > 60 | License |
|---|---|---|---|---|
| EndoNeRF (NeRF) | ~6 h | 35.43 | ✗ | Commercial |
| EndoSurf | ~7 h | 34.91 | ✗ | MIT |
| LerPlane-32k | 8 min | 31.77 | ✗ | BSD-3 |
| Endo-4DGS | 4 min | 37.85 | ✓ | Non-Commercial derivative |
| EndoGaussian | ≤2 min | 37.848 | ✓ | Non-Commercial derivative |
| Deform3DGS | ≈1 min | 37.90 | ✓ | Non-Commercial derivative |
| Surgical Gaussian Surfels | 45 min | 39.06 | ✓ | Non-Commercial derivative |
| G-SHARP (Apache-2.0) | 2 min | 37.98 | ✓ | Commercial (Apache-2.0) |
G-SHARP achieves 37.98 dB PSNR after 2 minutes of training per scene and reliably exceeds 60 FPS, confirming suitability for intra-operative deployment while maintaining full commercial license compatibility. Additional benchmarks indicate that during training, G-SHARP yields 24–27 dB PSNR in full-scene mode and 17–19 dB for tissue-only masked evaluation across EndoNeRF scenes (Nath et al., 2 Dec 2025).
3. Algorithmic Components of State-of-the-Art Pipelines
The G-SHARP pipeline exemplifies algorithmic advances in Gaussian-based neural representations:
- GSplat Differentiable Gaussian Rasterizer: Each 3D Gaussian primitive is parameterized by mean , covariance (with rotation and scales ), opacity , and color coefficients (spherical harmonics, up to degree ). Projection to the image plane is effected via the intrinsic matrix and extrinsics, with image-plane splat weight:
Front-to-back compositing yields pixel color:
CUDA-optimized kernel implementations enable efficient multi-view batch rasterization and memory usage.
- Deformable Tissue Reconstruction: Training proceeds in two stages: a coarse geometry initialization (200 iterations, ≈30 s, no deformation) followed by a fine optimization (1,500 iterations, ≈1.5 min) integrating deformation and appearance. The deformation network features a HexPlane grid of features in , factorized into six 2D planes with MLP decoding (8 layers/256 units) predicting per-primitive offsets . The deformed parameters update as:
- Composite Loss Function: Fine-stage optimization employs
where , , . Regularization includes time-smoothness, norm, and TV on spatial/temporal HexPlanes (all with weight 0.01).
- Occlusion Handling: The pipeline distinguishes “tissue-only” (tool-excluded) and “full-scene” (tissue+tool) rendering. Tool masks are systematically enforced during initialization and as supervision. An “invisible mask” aggregates all tool pixels across time, applying a total variation penalty to encourage smooth interpolation in chronically occluded regions.
4. Speed–Accuracy Trade-Offs and Practical Implications
The trade-off between speed and reconstruction fidelity is central in the benchmark. G-SHARP achieves a favorable balance: 2 minutes of training yields 37.98 dB PSNR and real-time rendering above 60 FPS. Methods such as Surgical Gaussian Surfels reach marginally higher fidelity (39.06 dB) at significantly higher compute cost (45 min training). NeRF-based pipelines require up to 6–7 hours for considerably lower fidelity (35 dB PSNR).
Further parameter tuning—such as adjusting Gaussian densification schedules, HexPlane resolutions, MLP size, or loss-term weights—enables trade of a few dB in PSNR for drastically lower training time or higher FPS (e.g., surpassing 100 FPS or sub-1 min training), though detailed curves are omitted. Table anchor points define key speed-accuracy operating regimes (Nath et al., 2 Dec 2025).
5. Deployment and Integration in Surgical Environments
Deployment of G-SHARP targets commercial, intra-operative use via the Holoscan SDK, leveraging NVIDIA IGX Orin and Thor edge systems. The Holoscan application orchestrates pipeline stages using modular operators—EndoNeRFLoaderOp, GsplatLoaderOp, GsplatRenderOp, HolovizOp, ImageSaverOp—processing only camera poses and timestamps at run-time, with all computationally intensive RGB/depth operations conducted offline.
On both IGX Orin and Thor hardware, the solution sustains 60 FPS at resolution with per-frame latency under 16 ms. The commercial Apache-2.0 license and minimal dependency stack facilitate seamless operating room integration, with the performance profile satisfying intra-operative visualization requirements (Nath et al., 2 Dec 2025).