AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos (2503.23282v1)

Published 30 Mar 2025 in cs.CV

Abstract: Estimating camera motion and intrinsics from casual videos is a core challenge in computer vision. Traditional bundle-adjustment based methods, such as SfM and SLAM, struggle to perform reliably on arbitrary data. Although specialized SfM approaches have been developed for handling dynamic scenes, they either require intrinsics or computationally expensive test-time optimization and often fall short in performance. Recently, methods like Dust3r have reformulated the SfM problem in a more data-driven way. While such techniques show promising results, they are still 1) not robust towards dynamic objects and 2) require labeled data for supervised training. As an alternative, we propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence in feed-forward fashion. Our intuition is that such a network can learn strong priors over realistic camera poses. To scale up our training, we rely on an uncertainty-based loss formulation and pre-trained depth and flow networks instead of motion or trajectory supervision. This allows us to use diverse, unlabelled video datasets obtained mostly from YouTube. Additionally, we ensure that the predicted trajectory does not accumulate drift over time through a lightweight trajectory refinement step. We test AnyCam on established datasets, where it delivers accurate camera poses and intrinsics both qualitatively and quantitatively. Furthermore, even with trajectory refinement, AnyCam is significantly faster than existing works for SfM in dynamic settings. Finally, by combining camera information, uncertainty, and depth, our model can produce high-quality 4D pointclouds.

Summary

Analysis of "AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos"

The paper "AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos" introduces an innovative approach to address a longstanding issue in computer vision—estimating camera motion and intrinsics directly from dynamic and casual video sequences. This research provides a robust alternative to traditional SfM and SLAM systems, especially when applied to videos with varying motion patterns and presence of dynamic objects.

Overview

AnyCam leverages a transformer-based model to predict camera poses and intrinsics in a feed-forward manner without the necessity of labeled data. It utilizes pre-trained monocular depth estimation (MDE) and optical flow networks as auxiliary input to guide the estimation process. Notably, the inclusion of uncertainty maps derived via a novel loss formulation enhances the model's ability to filter out dynamic objects that may introduce inconsistencies in motion estimation.

Technical Contributions

End-to-End Model Architecture: The use of transformers allows AnyCam to predict camera parameters by processing a sequence of frames concurrently, thereby learning powerful priors over plausible camera trajectories that are consistent with real-world dynamics.
Multi-frame Processing: Unlike traditional approaches that rely on frame-by-frame optimization, this model processes a sequence of video frames, using deep learning's ability to capture temporal information and improve robustness against dynamic elements and noise.
Uncertainty-Based Training Loss: The innovative training regime involves an uncertainty-aware formulation that enables the model to learn from unlabelled video data by dynamically adjusting the influence of optical flow inconsistencies associated with moving objects within a scene.
Test-Time Refinement: To mitigate drift in trajectory estimates over longer sequences, AnyCam integrates lightweight refinement through bundle adjustment, further improving accuracy in camera pose predictions.

Results and Implications

The paper demonstrates the efficacy of AnyCam through comprehensive evaluations on established datasets such as Sintel and TUM-RGBD, and highlights its competitive performance against supervised visual odometry (VO) and SLAM systems that utilize ground truth intrinsics. The model achieves state-of-the-art results in these dynamic environments both qualitatively and quantitatively.

AnyCam effectively reduces trajectory errors and relative pose errors, even when tested on datasets outside of its training distribution, such as waymo autonomous driving sequences and Aria everyday activity videos. This indicates strong generalization capabilities, which is crucial for deploying such models in real-world scenarios.

Practical and Theoretical Implications

From a practical standpoint, AnyCam opens up possibilities for utilizing vast amounts of unlabeled video data available online for training 3D models, widening the scope of applications in areas like AR/VR, robotics, and autonomous driving. Theoretically, the method emphasizes the importance of integrating uncertainty into deep learning systems, particularly in dynamic scene understanding, challenging the conventional reliance on static reconstruction assumptions inherent in classical approaches.

Future Outlook

The research sets a promising direction for future work by suggesting that incorporating context-aware priors embedded within transformer models can significantly advance the robustness and accuracy of motion predictions in diverse and challenging settings. Possible extensions could focus on enhancing long-term stability and scale adaptability, broadening datasets to include a wider range of global environments, and achieving even better results without pre-trained depth or flow models.

In summary, "AnyCam" represents a substantial advance in the domain of 3D computer vision, offering insights into how modern AI frameworks can overcome limitations associated with traditional methodologies and adapt to rapidly changing and dynamic scenes in casual videos.

Tweets

https://twitter.com/zhenjun_zhao/status/1907395132527456738

https://twitter.com/felixwimbauer/status/1915073401883967903