Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 59 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 40 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 104 tok/s Pro

Kimi K2 195 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors (2503.17316v1)

Published 21 Mar 2025 in cs.CV

Abstract: We present Pow3r, a novel large 3D vision regression model that is highly versatile in the input modalities it accepts. Unlike previous feed-forward models that lack any mechanism to exploit known camera or scene priors at test time, Pow3r incorporates any combination of auxiliary information such as intrinsics, relative pose, dense or sparse depth, alongside input images, within a single network. Building upon the recent DUSt3R paradigm, a transformer-based architecture that leverages powerful pre-training, our lightweight and versatile conditioning acts as additional guidance for the network to predict more accurate estimates when auxiliary information is available. During training we feed the model with random subsets of modalities at each iteration, which enables the model to operate under different levels of known priors at test time. This in turn opens up new capabilities, such as performing inference in native image resolution, or point-cloud completion. Our experiments on 3D reconstruction, depth completion, multi-view depth prediction, multi-view stereo, and multi-view pose estimation tasks yield state-of-the-art results and confirm the effectiveness of Pow3r at exploiting all available information. The project webpage is https://europe.naverlabs.com/pow3r.

Collections

Summary

The paper introduces Pow3R, a transformer-based model that enhances regression-based 3D reconstruction by incorporating diverse camera and scene priors (intrinsics, pose, depth) alongside images at test time.
Pow3R employs a flexible architecture capable of processing multiple auxiliary data types through dedicated encoder and decoder modules, enabling capabilities beyond previous models like native resolution inference and point-cloud completion.
Experimental results demonstrate Pow3R's superior performance over state-of-the-art methods on benchmarks like KITTI, Tanks and Temples, and NYUv2, achieving enhanced accuracy in multi-view depth and pose estimation, and depth completion by effectively leveraging auxiliary information.

Overview of Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors

The paper introduces Pow3R, a comprehensive solution designed to significantly augment the capabilities of regression-based 3D vision models by incorporating additional modalities at test time. The primary objective of Pow3R is to enhance the predictive prowess of 3D models by integrating auxiliary camera and scene priors, which previous models such as DUSt3R are unable to leverage.

Pow3R is a transformer-based architecture that innovatively allows various combinations of auxiliary data, including camera intrinsics, relative pose, and varying depths (dense, sparse), to be utilized alongside the input images themselves. This marks a shift from the traditional approach where models were often restricted to a single input modality, particularly, RGB images. The ability of Pow3R to process these auxiliary modalities in conjunction with images enables it to perform tasks with improved fidelity. This modularity facilitates operations such as the inference of native image resolution and point-cloud completion beyond the capabilities of predecessors like DUSt3R.

Architecture and Methodology Insights

Pow3R updates the basic DUSt3R framework by incorporating additional heads for pointmap regression for each image pair, enabling the prediction of 3D scenes in multiple camera coordinate systems. Furthermore, its design includes a dual-branch decoder system that can inject auxiliary information through specific modules, adding intrinsics and depth data seamlessly into the inference pipeline.

The architecture's encoder processes Siamese inputs for two views, where auxiliary data is embedded using learned multi-layer perceptrons (MLPs) and injected in the encoder blocks. Pow3R's decoder adopts cross-attention mechanisms and integrates relative pose data to enhance 3D reconstruction capabilities. Crucial configurations such as the inject-1 strategy ensure that the model remains computationally efficient while offering enhanced performance.

Strong Numerical Results and Claims

Experimental results demonstrate Pow3R's superiority over DUSt3R in various tasks. For multi-view depth estimation, Pow3R, when guided by intrinsics and relative poses, achieves state-of-the-art results across diverse benchmarks, including KITTI and Tanks and Temples. Notably, even without auxiliary data, Pow3R performs comparably to DUSt3R, indicating its robustness. Multi-view pose estimation benefits from the model's ability to predict pointmaps in both camera frames, providing significant speedups for relative pose estimation compared to traditional RANSAC-based PnP.

Depth completion experiments affirm Pow3R's efficacy in leveraging sparse depth information, showcasing enhanced performance on NYUv2, where it outperforms state-of-the-art methods despite not being explicitly trained on the dataset. It accomplishes impressive absolute relative errors across varying dataset conditions while integrating auxiliary inputs.

Theoretical and Practical Implications

From a theoretical standpoint, Pow3R's ability to unify various camera and scene priors in a single feed-forward model signifies a potential paradigm shift toward more flexible and capable 3D perception models. It presents the prospect of generalizing transformer-based architectures for broader 3D applications, where diverse input modalities need to be seamlessly integrated. Practically, this not only enhances the scope of tasks that can be addressed by such models but also opens avenues for improving real-time processing and reducing computational requirements in field applications.

Future Prospects in AI and 3D Reconstruction

Future explorations could involve expanding this framework to encompass even more diverse input configurations or applying it in more dynamic environments, potentially incorporating real-time updates for continuous depth and pose estimation in dynamic scenes. Enhancements could also focus on greater accuracy in non-cooperative conditions, such as occlusions or motion blur, thus pushing the boundaries for live 3D reconstructions in augmented reality or autonomous navigation systems.

In conclusion, Pow3R represents a meaningful advance in the field of 3D vision, showcasing the advantages of adaptable architectures capable of utilizing a variety of auxiliary information for richer and more precise 3D scene understanding. By doing so, Pow3R not only offers improved accuracy but also flexibility and efficiency that could drive future directions in AI research for 3D applications.