Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space (2507.00392v1)

Published 1 Jul 2025 in cs.CV

Abstract: Feature matching plays a fundamental role in many computer vision tasks, yet existing methods heavily rely on scarce and clean multi-view image collections, which constrains their generalization to diverse and challenging scenarios. Moreover, conventional feature encoders are typically trained on single-view 2D images, limiting their capacity to capture 3D-aware correspondences. In this paper, we propose a novel two-stage framework that lifts 2D images to 3D space, named as \textbf{Lift to Match (L2M)}, taking full advantage of large-scale and diverse single-view images. To be specific, in the first stage, we learn a 3D-aware feature encoder using a combination of multi-view image synthesis and 3D feature Gaussian representation, which injects 3D geometry knowledge into the encoder. In the second stage, a novel-view rendering strategy, combined with large-scale synthetic data generation from single-view images, is employed to learn a feature decoder for robust feature matching, thus achieving generalization across diverse domains. Extensive experiments demonstrate that our method achieves superior generalization across zero-shot evaluation benchmarks, highlighting the effectiveness of the proposed framework for robust feature matching.

Summary

The paper introduces a novel 3D-aware approach (L2M) that lifts single-view images to 3D, enhancing dense feature matching under challenging conditions.
It leverages monocular depth estimation and mesh reconstruction to synthesize novel views and enforce multi-view consistency in feature learning.
By generating large-scale synthetic training data, L2M achieves state-of-the-art zero-shot generalization across diverse benchmarks.

Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space

"Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space" (2507.00392) addresses fundamental limitations in current dense feature matching frameworks by rethinking the interplay between 2D representation learning and 3D geometry. The paper introduces Lift to Match (L2M), a two-stage pipeline in which single-view images are lifted into 3D space to enable both encoder 3D-awareness and broad synthetic data generation for robust decoder learning.

Problem Setting and Motivation

Traditional dense feature matching approaches (e.g., DKM, RoMa, GIM) have achieved impressive results, but their reliance on multi-view datasets introduces both scalability and generalization barriers. Multi-view image data for training is costly and limited in diversity; prevalent feature encoders, pretrained on 2D images, lack exposure to 3D scene consistency, resulting in fragility under viewpoint changes, occlusions, and geometric distortions. This is particularly limiting for applications such as SLAM, visual localization, and 3D reconstruction.

L2M is designed to overcome these constraints by:

Lifting single-view images to 3D via monocular depth prediction and mesh reconstruction, enabling both novel-view synthesis and explicit geometry supervision.
Injecting 3D geometry into the feature encoder through a differentiable 3D Gaussian scene representation, fostering multi-view consistency in the learned features.
Utilizing the same 3D lifting for synthetic novel-view data generation, vastly expanding the diversity and coverage of training pairs for the matching decoder.

Method

Lifting 2D to 3D for Novel-View Synthesis

The proposed pipeline begins with single-view images and uses state-of-the-art monocular depth estimation (e.g., Depth Anything V2) to produce dense depth maps. Each pixel is projected to 3D space using inferred depth and randomized intrinsics, resulting in a point cloud. A 3D mesh is reconstructed (e.g., via Poisson reconstruction), facilitating the rendering of novel views from arbitrary perspectives and lighting conditions. Occlusions are handled via an inpainting model (e.g., Stable-Diffusion v1.5), producing photorealistic, geometrically accurate image pairs and corresponding camera parameters.

3D-Aware Encoder Learning with Feature Gaussians

Unlike typical 2D encoders, L2M incorporates 3D scene consistency via the 3D feature Gaussian approach. For each image, several novel views are rendered, and associated DINOv2 features are used to optimize a set of 3D Gaussians representing the scene in both RGB and feature space. The encoder is then trained to reconstruct multi-view feature radiation fields, using a differentiable rasterizer and upsampling path supervised by per-pixel $L_1$ loss. This process yields an encoder whose representations are explicitly multi-view and geometry-consistent, not merely local 2D descriptors.

Robust Decoder Learning with Large-Scale Synthetic Data

The decoder is trained separately, with the encoder frozen, using the large-scale synthetic dataset generated from the 3D lifting pipeline. For each base image, many image pairs are synthesized by varying camera viewpoint and lighting. Dense correspondence labels are derived directly from the known 3D geometry and rendering parameters. Training on this data enables the decoder to generalize across a wide range of scenes, appearance changes, and geometric transformations, overcoming the locality and bias of traditional multi-view datasets.

Implementation Details

The pipeline employs a range of publicly available real-world image datasets (e.g., COCO, Google Landmarks, Cityscapes, etc.) for base images. Synthetic data amounts to approximately 525,000 training pairs. The encoder and decoder are trained on 4×A100 GPUs, with specific learning rates and batch sizes tailored for stability. The depth estimation, mesh reconstruction, and inpainting modules are all off-the-shelf high-performance models, enabling the entire build to focus on leveraging explicit 3D reasoning, rather than relying on controlled multi-view scenes.

Results

L2M demonstrates consistent state-of-the-art zero-shot generalization on the ZEB benchmark (a suite of 12 heterogeneous real and synthetic datasets), outperforming previous dense, semi-dense, and sparse matching frameworks in most scenarios. For instance, it achieves a mean AUC@5° pose error of 51.8%, exceeding previous bests such as GIM (51.2%) and RoMa (48.8%). The gains are especially marked in datasets with large viewpoint, lighting, and weather variation—showcasing superior multi-view and appearance robustness.

On MegaDepth-1500, after in-domain fine-tuning, L2M yields the top AUC across pose thresholds (63.1%@5°), and for cross-modal (RGB-IR) dense matching on METU-VisTIR, L2M presents a significant improvement (30.13%@5°, 53.11%@10°, 71.80%@20°) over prior methods, including dedicated cross-modal designs.

Ablation studies highlight the importance of both stages: removing the 3D-aware encoder or synthetic data generation leads to measurable generalization drops, especially across challenging synthetic and real datasets.

Qualitative visualizations further validate the approach with more precise and denser correspondences under complex, realistic conditions compared to other state-of-the-art dense matchers.

Implications and Future Directions

The paper's modular lifting-based architecture decouples feature learning from the constraints of multi-view data, establishing a practical paradigm shift toward scalable, open-world dense correspondences. By using monocular depth and 3D-aware feature learning, L2M breaks the dependency on difficult-to-acquire multi-view datasets and provides a path for leveraging the vast corpus of Internet-scale 2D imagery for robust geometric understanding.

Practical Implications:

L2M is deployable in systems where acquiring multi-view images is infeasible or expensive, e.g., consumer robotics, AR/VR, and autonomous navigation in unconstrained environments.
Its generalization to cross-modal matching suggests application to RGB-IR, RGB-Event, and multi-sensor fusion tasks, leveraging a single RGB pipeline.
The method is parallelizable and scalable for large-scale training, limited primarily by the computational intensity of 3D lifting, mesh reconstruction, and synthetic renderings.
Integrating 3D-aware encoders trained in this manner into downstream 3D vision and SLAM pipelines is straightforward, with existing codebases for monocular depth, inpainting, rasterization, and feature construction.

Limitations:

While monocular depth estimation has progressed, depth errors or artifacts may propagate into synthetic data, potentially impacting downstream matching in scenes with ambiguous geometry.
The computational cost for large-scale 3D scene lifting and synthetic data generation remains considerable, albeit amortized by pre-processing.

Future Developments:

Improved monocular depth models and inpainting for enhanced realism and geometric fidelity in training data.
Unification of the encoder and decoder stages into an end-to-end geometry-aware correspondence model.
Extension to domains beyond RGB, including multi-modality (e.g., LiDAR, Event, or SAR) matching, informed by explicit 3D structure.
Online domain adaptation by integrating scene-specific or task-specific geometry cues during deployment.

Conclusion

L2M constitutes a comprehensive, geometry-bridging solution for robust, generalizable dense feature matching from single-view images. By explicitly lifting single 2D images into 3D and optimizing both feature learning and data generation in this context, the work sets a foundation for future research in scalable correspondence learning, bridging gaps between 2D semantic richness and 3D geometric fidelity.