VisFusion: Visibility-aware Online 3D Scene Reconstruction from Videos (2304.10687v1)

Published 21 Apr 2023 in cs.CV

Abstract: We propose VisFusion, a visibility-aware online 3D scene reconstruction approach from posed monocular videos. In particular, we aim to reconstruct the scene from volumetric features. Unlike previous reconstruction methods which aggregate features for each voxel from input views without considering its visibility, we aim to improve the feature fusion by explicitly inferring its visibility from a similarity matrix, computed from its projected features in each image pair. Following previous works, our model is a coarse-to-fine pipeline including a volume sparsification process. Different from their works which sparsify voxels globally with a fixed occupancy threshold, we perform the sparsification on a local feature volume along each visual ray to preserve at least one voxel per ray for more fine details. The sparse local volume is then fused with a global one for online reconstruction. We further propose to predict TSDF in a coarse-to-fine manner by learning its residuals across scales leading to better TSDF predictions. Experimental results on benchmarks show that our method can achieve superior performance with more scene details. Code is available at: https://github.com/huiyu-gao/VisFusion

Citations (6)

View on Semantic Scholar

Summary

The paper presents a novel visibility-aware feature fusion method that computes similarity from 2D projections to accurately reconstruct 3D scenes.
It introduces ray-based voxel sparsification that locally selects occupied voxels along visual rays, effectively preserving fine structural details.
The residual learning strategy for TSDF prediction refines reconstructions across scales, outperforming state-of-the-art benchmarks in accuracy and completeness.

VisFusion: Visibility-aware Online 3D Scene Reconstruction from Videos

The paper "VisFusion: Visibility-aware Online 3D Scene Reconstruction from Videos" presents a novel approach for online 3D scene reconstruction utilizing posed monocular RGB videos. The primary focus of this research is on enhancing the feature fusion process by incorporating visibility awareness into the reconstruction pipeline, addressing limitations of previous methods that did not explicitly account for voxel visibility.

Key Contributions

The authors propose three major contributions:

Visibility-aware Feature Fusion: Unlike prior works, this paper introduces an explicit visibility prediction mechanism for voxel feature aggregation. By computing a similarity matrix from 2D projections of voxel features across different view images, the model efficiently infers visibility weights. This innovation allows the system to mitigate issues stemming from occlusion, as visibility is directly supervised using ground truth, ensuring a more accurate reconstruction of surface geometries.
Ray-based Voxel Sparsification: The paper critiques past volumetric approaches that globally apply a fixed threshold for voxel sparsification, which often results in over-sparsification and loss of fine details. The proposed solution performs sparsification locally along each visual ray. By retaining the most probable occupied voxels within locally derived ray-based windows, this method preserves structural details even in thin and complex regions.
Residual Learning for TSDF Prediction: Extending the coarse-to-fine prediction methodology, the model uses a residual learning strategy between TSDFs of successive scales. By predicting the TSDF residuals at finer levels, the approach improves TSDF estimation accuracy and supports smoother refinements of the reconstructed 3D model.

Methodology

Structured as a coarse-to-fine pipeline, VisFusion proceeds through several key stages. Initially, 3D volumetric features are extracted and projected into 2D across multiple camera views, enabling local feature fusion guided by the visibility weights calculated from pairwise feature similarities. For sparsification, visual rays corresponding to each image pixel are analyzed to identify occupied voxels, enhancing the representation of thin structures. The system incrementally integrates each video fragment into a global feature volume, using Gated Recurrent Units (GRU) for effective information fusion across time.

Experimental Results

The paper reports quantitative and qualitative improvements over state-of-the-art methods on popular benchmarks like ScanNet and 7-Scenes. VisFusion achieves notable reductions in Chamfer distance—demonstrating superior balance between accuracy and completeness—compared to previous online feature fusion techniques. Moreover, the method consistently outperforms others in F-score metrics, effectively highlighting its capability to produce detailed and coherent reconstructions.

Implications and Future Work

The proposed visibility-aware approach and adaptive sparsification are practical enhancements for real-world applications such as augmented reality and robotics, where real-time performance and reconstruction quality are critical. The paper suggests continuing this line of inquiry by integrating depth estimation techniques into the end-to-end volumetric framework to further improve reconstruction sharpness and coherence. Such advancements could bridge the gap between image-level depth predictions and volumetric surface inference, leading to even richer 3D representations.

In summary, VisFusion introduces significant methodological advancements for online 3D scene reconstruction by explicitly incorporating visibility considerations and adopting a more principled approach to voxel sparsification, setting a foundation for future explorations in efficient and precise 3D modeling from video data.

PDF Markdown

Related Papers

GitHub

GitHub - huiyu-gao/VisFusion: [CVPR 2023] Code for "VisFusion: Visibility-aware Online 3D Scene Reconstruction from Videos" (181 stars)