GeoCalib: Learning Single-image Calibration with Geometric Optimization (2409.06704v2)

Published 10 Sep 2024 in cs.CV

Abstract: From a single image, visual cues can help deduce intrinsic and extrinsic camera parameters like the focal length and the gravity direction. This single-image calibration can benefit various downstream applications like image editing and 3D mapping. Current approaches to this problem are based on either classical geometry with lines and vanishing points or on deep neural networks trained end-to-end. The learned approaches are more robust but struggle to generalize to new environments and are less accurate than their classical counterparts. We hypothesize that they lack the constraints that 3D geometry provides. In this work, we introduce GeoCalib, a deep neural network that leverages universal rules of 3D geometry through an optimization process. GeoCalib is trained end-to-end to estimate camera parameters and learns to find useful visual cues from the data. Experiments on various benchmarks show that GeoCalib is more robust and more accurate than existing classical and learned approaches. Its internal optimization estimates uncertainties, which help flag failure cases and benefit downstream applications like visual localization. The code and trained models are publicly available at https://github.com/cvg/GeoCalib.

Summary

The paper presents a novel hybrid approach combining deep neural networks and geometric optimization for accurate single-image camera calibration.
It introduces Perspective Fields and differentiable Levenberg-Marquardt optimization to predict both intrinsic and extrinsic camera parameters effectively.
GeoCalib demonstrates robustness across diverse datasets, significantly improving gravity, FoV, and distortion estimations in practical 3D imaging applications.

GeoCalib: Learning Single-Image Calibration with Geometric Optimization

The paper "GeoCalib: Learning Single-image Calibration with Geometric Optimization" presents GeoCalib, a novel approach for camera calibration using a single image by integrating deep neural networks (DNNs) with geometric optimization. This approach advances the state-of-the-art in single-image calibration, addressing limitations in previous methods that either relied heavily on classical geometry or purely on data-driven deep learning models.

Overview

Camera calibration is critical for numerous 3D imaging applications, such as metrology, 3D reconstruction, and novel view synthesis. Existing approaches focus either on classical geometrical methods, which are highly accurate in controlled environments but fail in less structured scenes, or on deep learning models trained end-to-end, which generalize better but often lack the accuracy of geometrical methods due to their need to infer projective geometry from data. GeoCalib addresses these issues by combining the robustness of deep learning with the precision of geometric constraints.

Key Contributions

GeoCalib estimates intrinsic and extrinsic camera parameters by leveraging a DNN to predict visual cues, termed Perspective Fields, and employs an optimization process informed by 3D geometry. This combination aims to exploit the strengths of both paradigms.

Perspective Fields: The model predicts per-pixel up-vectors and latitudes which act as visual cues. The up-vectors correspond to the direction of vertical lines, while latitudes capture the angle between the camera ray and the ground. These cues are further refined during optimization, providing geometric constraints.
Differentiable Optimization: GeoCalib introduces an optimization step using the Levenberg-Marquardt (LM) algorithm, which is differentiable. This allows the system to backpropagate through the optimization process and adjust the DNN parameters, thereby learning effective visual cues to constrain the camera model.
Flexibility and Robustness: GeoCalib proves to be versatile in handling different camera models and conditions by supporting arbitrary distortion models and leveraging known priors on camera parameters when available. The optimizations can be run jointly across multiple images to improve parameter estimates.

Numerical Results and Performance

GeoCalib demonstrates significantly improved performance on multiple benchmarks, including Stanford2D3D, TartanAir, MegaDepth, and LaMAR datasets, outperforming both classical and deep learning-based methods. Key numerical results include:

Gravity Estimation: GeoCalib achieves state-of-the-art performance with an AUC@1° up to 83.1% on Stanford2D3D, significantly surpassing previous methods.
Field of View (FoV) Estimation: The method enhances FoV accuracy, attaining a median error as low as 3.21° on Stanford2D3D.
Distortion Handling: Even when trained on pinhole images only, GeoCalib accurately estimates camera distortions, showing robustness across various real-world scenarios.

Practical and Theoretical Implications

From a practical standpoint, GeoCalib's ability to accurately estimate camera parameters from a single image enhances tasks such as visual localization, AR, and VR applications. The model can serve as a reliable pre-processing step for computer vision algorithms that require calibrated inputs.

Theoretically, the integration of differentiable optimization into a deep learning pipeline represents a meaningful advancement in geometric computer vision. It shows the potential of hybrid methods that combine learning-based approaches with classical optimization techniques to achieve robust and accurate results.

Future Directions

GeoCalib opens several avenues for future research:

Extending to Dynamic Scenes: Investigating the application of GeoCalib to dynamic scenes involving object movements and changes in lighting conditions.
Higher-Level Semantics: Exploring the use of more sophisticated semantic information to further improve visual cue predictions.
Broader Camera Models: Adapting and validating GeoCalib on a wider range of camera models, including those used in specialized fields such as medical imaging or astrophotography.

In conclusion, GeoCalib represents a significant advancement in single-image camera calibration, providing a robust model that effectively merges the geometric constraints of classical methods with the flexibility and generalization capabilities of deep learning. Its contributions pave the way for more accurate and reliable calibration techniques across a broad spectrum of imaging applications.