- The paper presents a novel visibility-aware feature fusion method that computes similarity from 2D projections to accurately reconstruct 3D scenes.
- It introduces ray-based voxel sparsification that locally selects occupied voxels along visual rays, effectively preserving fine structural details.
- The residual learning strategy for TSDF prediction refines reconstructions across scales, outperforming state-of-the-art benchmarks in accuracy and completeness.
VisFusion: Visibility-aware Online 3D Scene Reconstruction from Videos
The paper "VisFusion: Visibility-aware Online 3D Scene Reconstruction from Videos" presents a novel approach for online 3D scene reconstruction utilizing posed monocular RGB videos. The primary focus of this research is on enhancing the feature fusion process by incorporating visibility awareness into the reconstruction pipeline, addressing limitations of previous methods that did not explicitly account for voxel visibility.
Key Contributions
The authors propose three major contributions:
- Visibility-aware Feature Fusion: Unlike prior works, this paper introduces an explicit visibility prediction mechanism for voxel feature aggregation. By computing a similarity matrix from 2D projections of voxel features across different view images, the model efficiently infers visibility weights. This innovation allows the system to mitigate issues stemming from occlusion, as visibility is directly supervised using ground truth, ensuring a more accurate reconstruction of surface geometries.
- Ray-based Voxel Sparsification: The paper critiques past volumetric approaches that globally apply a fixed threshold for voxel sparsification, which often results in over-sparsification and loss of fine details. The proposed solution performs sparsification locally along each visual ray. By retaining the most probable occupied voxels within locally derived ray-based windows, this method preserves structural details even in thin and complex regions.
- Residual Learning for TSDF Prediction: Extending the coarse-to-fine prediction methodology, the model uses a residual learning strategy between TSDFs of successive scales. By predicting the TSDF residuals at finer levels, the approach improves TSDF estimation accuracy and supports smoother refinements of the reconstructed 3D model.
Methodology
Structured as a coarse-to-fine pipeline, VisFusion proceeds through several key stages. Initially, 3D volumetric features are extracted and projected into 2D across multiple camera views, enabling local feature fusion guided by the visibility weights calculated from pairwise feature similarities. For sparsification, visual rays corresponding to each image pixel are analyzed to identify occupied voxels, enhancing the representation of thin structures. The system incrementally integrates each video fragment into a global feature volume, using Gated Recurrent Units (GRU) for effective information fusion across time.
Experimental Results
The paper reports quantitative and qualitative improvements over state-of-the-art methods on popular benchmarks like ScanNet and 7-Scenes. VisFusion achieves notable reductions in Chamfer distance—demonstrating superior balance between accuracy and completeness—compared to previous online feature fusion techniques. Moreover, the method consistently outperforms others in F-score metrics, effectively highlighting its capability to produce detailed and coherent reconstructions.
Implications and Future Work
The proposed visibility-aware approach and adaptive sparsification are practical enhancements for real-world applications such as augmented reality and robotics, where real-time performance and reconstruction quality are critical. The paper suggests continuing this line of inquiry by integrating depth estimation techniques into the end-to-end volumetric framework to further improve reconstruction sharpness and coherence. Such advancements could bridge the gap between image-level depth predictions and volumetric surface inference, leading to even richer 3D representations.
In summary, VisFusion introduces significant methodological advancements for online 3D scene reconstruction by explicitly incorporating visibility considerations and adopting a more principled approach to voxel sparsification, setting a foundation for future explorations in efficient and precise 3D modeling from video data.