Visual Geometry Grounded Deep Structure From Motion

Published 7 Dec 2023 in cs.CV and cs.RO | (2312.04563v1)

Abstract: Structure-from-motion (SfM) is a long-standing problem in the computer vision community, which aims to reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images. Classical frameworks solve this problem in an incremental manner by detecting and matching keypoints, registering images, triangulating 3D points, and conducting bundle adjustment. Recent research efforts have predominantly revolved around harnessing the power of deep learning techniques to enhance specific elements (e.g., keypoint matching), but are still based on the original, non-differentiable pipeline. Instead, we propose a new deep pipeline VGGSfM, where each component is fully differentiable and thus can be trained in an end-to-end manner. To this end, we introduce new mechanisms and simplifications. First, we build on recent advances in deep 2D point tracking to extract reliable pixel-accurate tracks, which eliminates the need for chaining pairwise matches. Furthermore, we recover all cameras simultaneously based on the image and track features instead of gradually registering cameras. Finally, we optimise the cameras and triangulate 3D points via a differentiable bundle adjustment layer. We attain state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, and ETH3D.

Abstract PDF HTML Upgrade to Chat

References (99)

Citations (18)

View on Semantic Scholar

Summary

The paper introduces a novel fully differentiable deep learning pipeline, VGGSfM, for end-to-end structure-from-motion.
The method employs deep 2D point tracking and simultaneous camera pose recovery to enhance 3D reconstruction accuracy.
The approach integrates a differentiable bundle adjustment layer, demonstrating superior performance on benchmark datasets.

The paper "Visual Geometry Grounded Deep Structure From Motion" addresses the challenging problem of Structure-from-motion (SfM), which involves reconstructing the 3D structure of a scene and camera poses from a series of 2D images. Traditional approaches to SfM follow an incremental pipeline, including steps such as keypoint detection, matching, image registration, triangulation, and bundle adjustment. These methods have notably relied on non-differentiable procedures, limiting the scope for leveraging end-to-end learning's full potential.

The primary contribution of the paper is the introduction of VGGSfM, a novel deep learning-based pipeline for SfM where all components are fully differentiable. This end-to-end approach allows the model to be trained holistically, optimizing all parameters simultaneously rather than in isolated steps.

Key Innovations and Mechanisms Introduced:

Deep 2D Point Tracking: The authors utilize advancements in deep 2D point tracking to extract pixel-accurate tracks directly, removing the dependency on chaining pairwise keypoint matches typically required in classical methods.
Simultaneous Camera Recovery: Instead of the traditional incremental approach, VGGSfM recovers all camera poses simultaneously. This is achieved using both the image data and track features, which enhances the coherence and accuracy of the camera pose estimations.
Differentiable Bundle Adjustment: The paper introduces a differentiable bundle adjustment layer that allows the simultaneous optimization of camera parameters and the triangulation of 3D points. This is crucial for ensuring that the entire process can be backpropagated through, enabling end-to-end training.

Performance and Evaluation:

VGGSfM demonstrates state-of-the-art performance across three well-regarded datasets: CO3D, IMC Phototourism, and ETH3D. The results underscore the robustness and accuracy of the proposed method, highlighting its superiority over traditional non-differentiable pipelines and other recent deep learning enhancements in specific pipeline components.

In conclusion, VGGSfM represents a significant advancement in the field of Structure-from-motion by presenting a fully differentiable, end-to-end deep learning approach. This novel pipeline appears promising for improving the accuracy and reliability of 3D scene reconstruction from 2D images.

Markdown