- The paper introduces a novel integration of a native fisheye camera model into 3D Gaussian Splatting, eliminating lossy preprocessing and preserving peripheral details.
- It presents a cross-view joint optimization strategy that enhances geometric consistency and photometric gradient alignment across multiple views.
- Experimental results on datasets like FisheyeNeRF and ScanNet++ show improved SSIM, PSNR, and detail preservation in both edge and interior regions.
DirectFisheye-GS: Native Fisheye Camera Integration and Cross-View Optimization for Gaussian Splatting
Introduction and Motivation
DirectFisheye-GS systematically addresses two central deficiencies in existing 3D Gaussian Splatting (3DGS) pipelines for novel view synthesis (NVS) using fisheye cameras: (1) the inability of 3DGS to natively accommodate nonlinear wide-FOV projections without a lossy preprocessing step, and (2) suboptimal optimization stemming from single-view per-iteration updates which neglect spatial and photometric correlations across views. The work presents both a precise analytic formulation for accurate and differentiable fisheye projection and a principled cross-view joint optimization (CVO) strategy for training, resulting in a fully explicit and high-fidelity pipeline compatible with mainstream rasterization-based 3DGS renderers.
Native Fisheye Camera Model Embedding
The core technical innovation entails embedding the Kannala-Brandt polynomial fisheye projection model into the 3DGS rendering and optimization loop, replacing the conventional undistortion process which produces black borders, discards boundary content, and lowers effective spatial detail—detrimental effects particularly severe for wide-angle imagery where scene information is condensed toward the periphery.
Figure 1: Common fisheye camera projection models, including their analytic parameterization relevant for rasterization and differential optimization.
The analytic model supports differentiable forward and inverse mappings essential for end-to-end gradient-based learning. Crucially, the derived Jacobian matrix for the projection allows for precise backpropagation of gradients through the nonlinear transformation, accurately propagating updates to 3D Gaussian means and covariances, especially in regions of nonlinear distortion at large incident angles.
Cross-View Joint Optimization (CVO) Strategy
Standard 3DGS and previously proposed fisheye extensions (e.g., Fisheye-GS, 3DGUT) rely on per-iteration random view selection for stochastic scene coverage, or unsent results by sampling-based covariance propagation which is insufficient in highly distorted scenarios. DirectFisheye-GS instead introduces a camera association graph constructed from explicit multi-view 2D-2D feature correspondences (e.g., SIFT) paired with pose angular divergence heuristics. For each batch, correlated views possessing maximal angular variance and feature overlap are selected for joint optimization.
This design maximizes the likelihood that projected Gaussians correspond to co-visible 3D points, enhancing both geometric consistency and photometric gradient alignment across varying perspectives. The CVO update enforces joint constraints on scale, orientation, SH coefficients, and alpha blending, strongly regularizing model ambiguity particularly at fisheye image borders where single-view gradients are highly anisotropic.
Figure 2: The proposed cross-view joint optimization paradigm contrasts with the single-view updates, promoting geometric and photometric consistency across feature-overlapping, high-diversity views.
Figure 3: Camera association method based on feature overlap and angular divergence, guiding batch sampling for CVO.
Experimental Analysis
Extensive benchmarks are presented on FisheyeNeRF, ScanNet++, and Den-SOFT (spanning object-centric to large-scale, dense VR/AR scenes). DirectFisheye-GS consistently reports either SOTA or competitive metrics against both native and derivative baselines:
- On FisheyeNeRF, DirectFisheye-GS attains average SSIM/PSNR/LPIPS scores of 0.8284/26.25/0.2295—matching or exceeding Fisheye-GS and 3DGUT, especially in high-distortion edge regions.
- On ScanNet++ test views, a similar trend is observed, with DirectFisheye-GS outperforming dense neural and explicit baselines in both perceptual and structural metrics.
- On large-scale Den-SOFT sequences, the method reports clear gains at the challenging boundaries, with sharpness, edge structure, and texture integrity preserved—areas where prior work exhibits excessive blurring, mosaic artifacts, or floating Gaussians.
Figure 4: Qualitative comparison on FisheyeNeRF, demonstrating less floaters, improved detail, and sharper boundaries for DirectFisheye-GS over prior methods.
Figure 5: Distribution of Gaussian scales in FisheyeNeRF-Chairs—CVO eliminates extreme shapes and anomalous scaling at the image periphery.
Specifically, through ablation, the inclusion of CVO in both fisheye and pinhole scenes leads to improved convergence, more uniform and realistic Gaussian parameter distributions, and higher PSNR/SSIM. Toy experiments demonstrate that DirectFisheye-GS delivers more stable, anisotropy-free fits at wide-angle, high-distortion image boundaries, avoiding the typical "mosaic" artifacts seen in 3DGUT.
Figure 6: Toy distortion experiment—strong fisheye warping leads to gradient misalignment and unstable optimization; native modeling mitigates these issues.
Figure 7: DirectFisheye-GS yields clean, artifact-free reconstructions near fisheye image boundaries; 3DGUT displays geometry degradation and visible discontinuities.
On Den-SOFT, DirectFisheye-GS provides numerically and visually superior results in both boundary and interior regions, a trend found consistent in all large-scale evaluations.
Figure 8: Qualitative outcomes on Den-SOFT, showing high-frequency detail and structural consistency in both indoor and outdoor environments.
Implications for Large-Scale and Real-Time Computer Vision
The ability to natively support arbitrary camera models without sacrificing rasterization-based rendering efficiency obviates the need for destructive preprocessing and allows DirectFisheye-GS to act as a drop-in module for most industrial and research 3DGS pipelines. CVO is model-agnostic and provides a general recipe for robust optimization in settings with strong nonlinear projection, extreme FOV, or dense stereo coverage. The results indicate improved fidelity not just at image centers but also at boundaries, directly impacting applications in immersive VR/AR, SLAM, robotic navigation, and wide-FOV video postproduction.
Notably, the integration preserves compatibility with standard 3DGS viewers and is not restricted by intermediate representations (e.g., no ray-tracing overheads), ensuring scalability and interoperability.
Theoretical and Practical Limitations
While the method exhibits robust empirical and qualitative improvements, the performance gain from CVO under extremely challenging lighting (view-dependent reflectance, refraction) is limited, motivating future work on richer SH modeling, explicit reflectance parameterization, or hybrid rasterization/ray-tracing approaches. Additionally, the dependency on structure-from-motion for feature association may be augmented with improved matching or semantic information for further robustness.
Figure 9: Ablation of cross-view joint optimization on different camera models, emphasizing the universality of the proposed strategy.
Conclusion
DirectFisheye-GS presents an explicit, differentiable solution for high-fidelity NVS with native fisheye images, underpinned by analytic camera modeling and cross-view optimization. The proposed approach achieves high rendering quality, efficient training dynamics, and maintains architectural compatibility with established explicit representations. CVO offers a general augmentation for all explicit multiview pipelines, not limited to Gaussians or splatting, and the analytic gradient propagation through nonlinear projections sets a methodological baseline for future research in wide-FOV and nonpinhole imaging. This work establishes an extensible framework, opening directions for integrating advanced view-dependent effects, hybrid camera types, and further advances in real-time neural rendering.