TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers (2408.13770v1)

Published 25 Aug 2024 in cs.CV

Abstract: Compared with previous 3D reconstruction methods like Nerf, recent Generalizable 3D Gaussian Splatting (G-3DGS) methods demonstrate impressive efficiency even in the sparse-view setting. However, the promising reconstruction performance of existing G-3DGS methods relies heavily on accurate multi-view feature matching, which is quite challenging. Especially for the scenes that have many non-overlapping areas between various views and contain numerous similar regions, the matching performance of existing methods is poor and the reconstruction precision is limited. To address this problem, we develop a strategy that utilizes a predicted depth confidence map to guide accurate local feature matching. In addition, we propose to utilize the knowledge of existing monocular depth estimation models as prior to boost the depth estimation precision in non-overlapping areas between views. Combining the proposed strategies, we present a novel G-3DGS method named TranSplat, which obtains the best performance on both the RealEstate10K and ACID benchmarks while maintaining competitive speed and presenting strong cross-dataset generalization ability. Our code, and demos will be available at: https://xingyoujun.github.io/transplat.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces TranSplat, a transformer-based approach that enhances 3D reconstruction with depth-aware deformable matching and monocular depth priors.
It employs a coarse-to-fine matching strategy with a Depth Refine U-Net module to accurately refine depth maps in sparse-view conditions.
Experimental results on RealEstate10K, ACID, and DTU benchmarks demonstrate state-of-the-art performance and strong cross-dataset generalization within 200K training iterations.

TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers

In "TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers," Zhang et al. introduce a cutting-edge method for generalizable sparse-view 3D scene reconstruction. This method leverages a transformer-based architecture, termed TranSplat, to significantly enhance the rendering and reconstruction of 3D scenes from sparse multi-view images.

Overview

Traditionally, approaches like Neural Radiance Fields (NeRF) and recent 3D Gaussian Splatting (G-3DGS) methods have demonstrated impressive results in 3D reconstruction and novel view synthesis. However, these methods often struggle when applied to sparse-view settings, particularly in feature matching across views in scenes with non-overlapping regions, low textures, or repetitive patterns. Existing methods require accurate multi-view feature matching which is challenging to achieve without scene-specific optimization.

TranSplat addresses these challenges by integrating several key strategies:

Depth-Aware Deformable Matching Transformer (DDMT): This module enhances depth candidates with high confidence, enabling better feature matching across different views.
Depth Refine U-Net: This module incorporates monocular depth priors to refine depth maps in regions without cross-view matches.
Camera Parameter Encoding: Integrating camera projection matrices into the CNN features to provide global spatial information.

Methodology

Feature Extraction

TranSplat employs a standard CNN and Transformer framework to extract multi-view image features. Camera parameters are incorporated through a squeeze-excitation (SE) layer, which injects global spatial information into the feature maps. Additionally, the DepthAnythingV2 module provides monocular depth priors.

Coarse Matching

The coarse matching module generates an initial depth distribution using plane-sweep stereo to construct depth candidates. Multi-view feature similarities are computed using the DDMT module. These depth distributions guide more focused depth predictions during fine matching.

Coarse-to-Fine Matching

The DDMT module refines the initial depth through deformable sampling, allowing the network to prioritize and adjust its attention based on depth confidence maps. This ensures that matching is accurate in areas that are traditionally challenging, such as those with low textures or repetitive patterns.

Depth Refine U-Net

By leveraging monocular depth priors, the Depth Refine U-Net refines depth maps, especially in regions with insufficient cross-view matches. It combines the accurate geometric consistency from matching results with reliable monocular depth information.

Gaussian Parameter Prediction

The final step involves predicting 3D Gaussian parameters—center, opacity, covariance, and color. These parameters enable efficient and high-quality rendering of novel views.

Experimental Results

TranSplat was evaluated on the RealEstate10K and ACID benchmarks, showing superior performance over state-of-the-art methods. The method exhibits strong numerical results, achieving best-in-class performance on multiple metrics (PSNR, SSIM, LPIPS). Notably, with only 200K training iterations, TranSplat outperforms other methods that require longer training times.

Furthermore, TranSplat demonstrates impressive cross-dataset generalization. In zero-shot testing on the DTU dataset using a model trained on RealEstate10K, TranSplat outperformed other state-of-the-art methods by a significant margin, indicating its robustness in varied and unseen environments.

Implications

The proposed TranSplat method contributes several theoretical and practical advancements:

Theoretical Contributions: The introduction of the DDMT module and Depth Refine U-Net showcases an innovative approach to improving depth estimation and feature matching in sparse-view settings.
Practical Benefits: By achieving high-quality 3D reconstruction with fewer views, TranSplat can be applied to real-world scenarios where acquiring dense multi-view images is impractical. This has significant implications for fields like virtual reality, augmented reality, and autonomous navigation.

Future Directions

Looking ahead, the potential for further improving the efficiency and accuracy of TranSplat is promising. Future work could explore:

Enhanced training methodologies for improving cross-view consistency in more complex scenes.
Integrating more sophisticated priors from large-scale pre-trained models to boost performance further in non-overlapping and repetitive regions.
Adapting TranSplat to handle dynamic scenes, thereby broadening its applicability in real-time applications.

Conclusion

TranSplat represents a substantial advancement in the field of generalizable sparse-view 3D scene reconstruction. By leveraging transformer-based architectures and monocular depth priors, the method addresses key challenges in feature matching and depth estimation, setting a new benchmark in both efficiency and effectiveness. The success of TranSplat on different benchmarks and its strong cross-dataset generalization underscores its potential for widespread adoption in diverse applications, paving the way for future innovations in 3D scene reconstruction.

PDF Markdown

Related Papers

GitHub

TranSplat

Tweets

https://twitter.com/janusch_patas/status/1828295705959580116