FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent (2404.15259v3)

Published 23 Apr 2024 in cs.CV

Abstract: This paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence. Our method performs per-video gradient-descent minimization of a simple least-squares objective that compares the optical flow induced by depth, intrinsics, and poses against correspondences obtained via off-the-shelf optical flow and point tracking. Alongside the use of point tracks to encourage long-term geometric consistency, we introduce differentiable re-parameterizations of depth, intrinsics, and pose that are amenable to first-order optimization. We empirically show that camera parameters and dense depth recovered by our method enable photo-realistic novel view synthesis on 360-degree trajectories using Gaussian Splatting. Our method not only far outperforms prior gradient-descent based bundle adjustment methods, but surprisingly performs on par with COLMAP, the state-of-the-art SfM method, on the downstream task of 360-degree novel view synthesis (even though our method is purely gradient-descent based, fully differentiable, and presents a complete departure from conventional SfM).

Citations (15)

View on Semantic Scholar

Summary

The paper presents FlowMap, an end-to-end differentiable framework that recovers camera poses, intrinsics, and dense depth maps using gradient descent on optical flow correspondences.
It employs a neural network for per-pixel depth estimation alongside a differentiable Procrustes method for accurate camera pose alignment from video data.
Experiments demonstrate that FlowMap matches state-of-the-art methods like COLMAP while offering faster convergence and new opportunities for self-supervised 3D reconstruction.

FlowMap: Leveraging Gradient Descent for High-Quality Camera Pose, Intrinsics, and Dense Depth Estimation from Video Sequences

Introduction

The paper introduces FlowMap, an innovative framework for recovering camera poses, intrinsics, and dense depth maps from video sequences using gradient descent on per-video optical flow and point track correspondences. This approach challenges traditional Structure-from-Motion (SfM) methods by providing an end-to-end differentiable pipeline that can be integrated into deep learning workflows, thus enabling new applications in self-supervised learning for 3D reconstruction and beyond.

Methodology

FlowMap is unique in its approach to parameterizing depth, camera poses, and intrinsics. This parameterization enables it to employ a fully differentiable framework optimized via first-order methods. Here are the technical details of its components:

Depth Parameterization: A neural network predicts per-pixel depth maps from RGB frames. This network architecture facilitates the leveraging of learned regularities across video frames, improving depth estimation under challenging conditions.
Pose Estimation: Relative poses between frames are derived through an analytical solution to a differentiable Procrustes problem, which optimizes the alignment of unprojected points transformed by optical flow from one frame to another.
Camera Intrinsics: Focal length and other intrinsic parameters are resolved by evaluating a pre-defined range of plausible values and selecting those that minimize re-projection errors through a differentiable selection mechanism.

FlowMap combines these estimates to perform optimization directly on video sequences, supervised solely by correspondences generated from optical flow and point tracking algorithms.

Evaluation

The efficacy of FlowMap is demonstrated through rigorous experimental evaluation on standard datasets. The method not only performs on par with COLMAP, a state-of-the-art SfM approach, for novel view synthesis tasks but also exceeds the capabilities of other gradient descent-based methods in terms of speed and convergence. Experiments include qualitative assessments such as visual inspections of generated depth maps and 3D reconstructions, as well as quantitative metrics like Pose Error and Structure Similarity Index (SSIM) on synthesized views.

Discussions and Future Work

The introduction of FlowMap represents a substantial shift in handling video-based 3D reconstruction by making the entire SfM pipeline differentiable and trainable. This capability opens up new avenues for integrating SfM deeper into machine learning frameworks, potentially enhancing tasks that rely on geometric understanding.

Future work could explore several potential improvements and expansions:

Enhanced Depth Networks: Incorporating more advanced network architectures might yield better depth predictions and finer details in reconstructions.
Robustness to Scene Dynamics: Adapting FlowMap to handle dynamic scenes where objects within the frames move independently of the camera could dramatically broaden its applicability.
Efficiency Improvements: There are opportunities to optimize the computational efficiency of FlowMap, making it feasible for longer sequences or higher-resolution videos.

Conclusion

FlowMap sets a new precedent for video-based camera and scene geometry estimation. By leveraging differentiable programming and end-to-end training, it holds promise for future research and applications in 3D scene understanding and reconstruction from video data.

PDF Markdown

Related Papers

Tweets

https://twitter.com/zhenjun_zhao/status/1782977009876127930

https://twitter.com/fly51fly/status/1783255053379477961

https://twitter.com/hillbig/status/1785090064189677780

https://twitter.com/arxivsanitybot/status/1783130716726874254