EndoGaussians: 3D Gaussian Splatting for Endoscopy
- EndoGaussians is a framework that uses explicit 3D Gaussian splats to reconstruct deformable tissues from endoscopic RGBD video streams.
- It achieves real-time performance and improved accuracy through optimized, learned deformation models and physical depth priors.
- The method integrates video inpainting, point-cloud initialization, and hallucination masking to enhance interpretability for intraoperative visualization and simulation.
EndoGaussians is an explicit 3D Gaussian splatting framework for dynamic, single-view reconstruction of deformable tissues from endoscopic RGBD video streams. Addressing technical and practical limitations of preceding neural radiance field (NeRF)-based techniques, EndoGaussians advances accuracy, interpretability, and real-time performance in the 3D modeling of soft tissues under non-rigid motion, laying groundwork for improved intraoperative visualization, analysis, and medical simulation (Chen et al., 2024).
1. Rationale and Architectural Principles
EndoGaussians replaces the implicit NeRF-based volumetric representation with a sparse, explicit set of 3D Gaussian splats. Each Gaussian encodes a local “particle” of tissue structure, transparently distinguishing observed, data-driven anatomy from hallucinated or uncertain regions. In contrast to NeRF-based methods such as EndoNeRF, EndoSurf, and ForPlane, the framework directly optimizes these splats from a stream of single-view RGBD frames, efficiently enforcing physical priors on depth and deformation. Key architectural attributes include:
- Real-time optimization and rendering via Gaussian splatting, reducing convergence times to minutes per video.
- Robustness to soft-tissue motion using learned, per-Gaussian dynamic deformation models.
- State-of-the-art reconstruction metrics (PSNR, SSIM, LPIPS) on challenging endoscopic datasets.
- Explicit hallucination masking, demarcating uncertain or occluded anatomical regions.
2. Mathematical Model of 3D Gaussian Splatting
The fundamental unit of EndoGaussians is a 3D Gaussian splat parameterized by mean , covariance (positive semidefinite via ), opacity logit (rendered through ), and appearance features encoded using spherical harmonic coefficients .
The global, continuous density field is
where weights are tied to opacity. Color and density along a ray are rendered via volumetric integration:
with transmittance . Practically, this is implemented with ordered, discrete “alpha-splat” compositing:
where is the per-Gaussian depth contribution.
3. Spatiotemporal Deformation and Regularization
To model tissue motion, each Gaussian is equipped with a time-dependent warping function:
where is its canonical position and are the learned rotation and translation at time . All deformation, shape, appearance, and opacity variables are optimized jointly.
Physical plausibility is enforced with several losses:
- Rigid-pair loss: Maintains relative positions of neighbors across frames
- Rotational smoothness: Encourages consistent rotations among neighbors
- Isometric regularization: Preserves inter-Gaussian distances
Supervision incorporates photometric and depth L1 terms and an optional Huber-style depth smoothness prior. The aggregate objective at frame is
4. Computational Pipeline
The EndoGaussians pipeline comprises four principal phases:
- Video Inpainting: Tool and occlusion removal with a Flow-Guided Transformer (FGT), yielding clean RGBD images and soft-tissue masks.
- Point-Cloud Initialization: Dense 3D points are projected from each (x, y, D(x,y)) tuple as
A Gaussian is seeded per point with small initial covariance.
- Camera Calibration: Intrinsics are known; extrinsics estimated from stereo or SLAM.
- Joint Training: All Gaussian and deformation variables are optimized using Adam, supervised by photometric, depth, deformation, and hallucination-mask losses. Training typically converges in 20–30 minutes on a 100–200 frame sequence.
5. Quantitative and Qualitative Evaluation
Empirical comparisons on EndoNeRF and SCARED datasets demonstrate substantial performance improvements over prior approaches. A summary of principal measurements for a single scene is provided below:
| Metric | ForPlane [MICCAI ’23] | EndoGaussians |
|---|---|---|
| PSNR | 36.457 | 37.654 |
| SSIM | 0.946 | 0.965 |
| LPIPS | 0.058 | 0.036 |
| Render time/frame (s) | ~1.7 | ~0.04 |
Compared to EndoNeRF and EndoSurf, EndoGaussians improves PSNR by 1–2 dB and SSIM by 1–3 points, while enabling real-time rendering at 25 fps. Reconstructed RGB and depth frames exhibit sharper anatomical boundaries and reduced hallucination, particularly in vessel and sulcal regions. Smoothness and stability during rapid deformation are attributed to rigid and rotational losses. Ablations reveal that omitting depth loss introduces drift, suppressing deformation regularization leads to Gaussian collapse, and removing hallucination loss causes tool-occluded regions to be spuriously reconstructed.
6. Interpretability, Limitations, and Future Directions
EndoGaussians enables explicit, interpretable segmentation of observed versus hallucinatory content, directly mapping splat assignments onto the 3D geometry. This clarifies uncertainty and supports more reliable intraoperative analytics. However, the model currently requires precomputed masks and depth, and moderate GPU memory. Extreme tissue topological changes may exceed the representational power of the per-Gaussian deformation model.
Potential clinical applications include real-time 3D display for VR surgical navigation, quantitative tissue motion tracking in minimally invasive surgery, and synthetic data generation for robotic surgery training (Chen et al., 2024). A plausible implication is that future research may focus on extending the deformation model to accommodate more extensive structural changes, as well as integrating end-to-end learning with on-the-fly inpainting and depth inference.