FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views (2502.12138v4)

Published 17 Feb 2025 in cs.CV

Abstract: We present FLARE, a feed-forward model designed to infer high-quality camera poses and 3D geometry from uncalibrated sparse-view images (i.e., as few as 2-8 inputs), which is a challenging yet practical setting in real-world applications. Our solution features a cascaded learning paradigm with camera pose serving as the critical bridge, recognizing its essential role in mapping 3D structures onto 2D image planes. Concretely, FLARE starts with camera pose estimation, whose results condition the subsequent learning of geometric structure and appearance, optimized through the objectives of geometry reconstruction and novel-view synthesis. Utilizing large-scale public datasets for training, our method delivers state-of-the-art performance in the tasks of pose estimation, geometry reconstruction, and novel view synthesis, while maintaining the inference efficiency (i.e., less than 0.5 seconds). The project page and code can be found at: https://zhanghe3z.github.io/FLARE/

Summary

The paper introduces a novel feed-forward model that accurately estimates coarse camera poses using a transformer-based neural predictor instead of traditional feature matching.
It employs a two-stage geometry reconstruction strategy, first predicting local camera-centric structures and then consolidating them into a global 3D representation.
FLARE achieves real-time inference (<0.5s) and improved accuracy on benchmarks, enhancing applications in AR, robotics, and autonomous navigation.

An Expert Overview of "FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views"

The paper "FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views" introduces a novel feed-forward model capable of recovering high-quality camera poses, 3D geometry, and scene appearance from uncalibrated sparse-view images. This work addresses significant challenges faced by traditional Structure-from-Motion (SfM) and Multi-View Stereo (MVS) methods, which typically require extensive view overlap and accurate initial camera estimations that are not always present in real-world data collection scenarios.

Technical Approach

FLARE employs a cascade learning paradigm, leveraging camera pose estimation as a foundational step for subsequent tasks. The initial phase involves estimating coarse camera poses from sparse views, utilizing a neural pose predictor based on transformer architecture. This approach foregoes conventional feature matching, which often proves ineffective with limited view overlap, in favor of direct pose estimation through learned geometric priors.

Following pose estimation, the model adopts a two-stage strategy for geometry reconstruction. Initially, the system predicts camera-centric geometric structures, which simplifies the task by allowing local view-specific information to guide learning. A subsequent global geometry projector then consolidates these local geometries into a cohesive global representation.

For appearance modeling, FLARE reconstructs photorealistic views by leveraging 3D Gaussians, initialized with the learned geometry, to predict appearance features. This model integrates detailed spatial information from deep neural networks, employing features from large pre-trained encoders such as VGG, to ensure high fidelity in rendering novel views.

Empirical Performance

The model's efficacy is demonstrated through extensive evaluations across several challenging datasets, including MegaDepth, ARKitScenes, and DL3DV. FLARE consistently outperforms existing approaches in tasks of pose estimation, geometry reconstruction, and novel-view synthesis. Specifically, on camera pose estimation tasks using the RealEstate10K dataset, FLARE achieves an AUC@30° of 84.6, a significant improvement over benchmark methods such as DUSt3R and PoseDiffusion. Similarly, for sparse-view 3D reconstruction, FLARE demonstrates superior accuracy and completeness on datasets like ETH3D and TUM RGBD.

Moreover, FLARE's architecture allows it to maintain inference times under 0.5 seconds, a substantial performance gain over optimization-reliant approaches like DUSt3R and MASt3R, which suffer from temporal inefficiencies.

Implications and Future Directions

FLARE's approach presents meaningful applications in fields requiring flexible and efficient reconstruction pipelines, such as augmented reality, robotics, and autonomous navigation. The model's ability to deliver high-quality reconstructions from sparse and uncalibrated data expands its utility in scenarios where comprehensive environmental surveying is impractical or impossible.

The paper opens pathways for several future research directions. The scalability of FLARE to handle thin structures or highly complex scenes remains an open challenge. Additionally, integrating robust mechanisms to handle diverse camera trajectories and out-of-distribution scenarios could further enhance its generalizability. Future work may also explore multi-scale geometry representations to refine reconstruction details and integrate more elaborate rendering techniques to improve appearance fidelity.

In summary, FLARE stands out as an efficient and scalable solution to longstanding challenges in multi-view 3D reconstruction, with robust empirical results and promising implications for numerous practical applications in AI-driven perception and graphics.