CalibAnyView: Beyond Single-View Camera Calibration in the Wild
Abstract: Camera calibration is a fundamental prerequisite for reliable geometric perception, yet classical approaches rely on controlled acquisition setups that are impractical for in-the-wild imagery. Recent learning-based methods have shown promising results for single-view calibration, but inherently neglect geometric consistency across multiple views. We introduce CalibAnyView, a unified formulation that supports an arbitrary number of input views ($N \geq 1$) by explicitly modeling cross-view geometric consistency. To facilitate this, we construct a large-scale multi-view video dataset covering diverse real-world scenarios, including multiple camera models, dynamic scenes, realistic motion trajectories, and heterogeneous lens distortions. Building on this dataset, we develop a multi-view transformer that predicts dense perspective fields, which are further integrated into a geometric optimization framework to jointly estimate camera intrinsics and gravity direction. Extensive experiments demonstrate that CalibAnyView consistently outperforms state-of-the-art methods, achieves strong robustness under single-view settings, and further improves with multi-view inference, providing a reliable foundation for downstream tasks such as 3D reconstruction and robotic perception in the wild.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
CalibAny View: A simple explanation
What is this paper about?
This paper is about “calibrating” a camera from everyday photos or videos, even when they’re taken in messy, real‑world situations. Calibrating a camera means figuring out its hidden settings—like how zoomed‑in it is and how its lens bends the image—so computers can correctly understand the 3D world from 2D pictures. The authors present a new method called CalibAny View that works with one image or many images and stays accurate even when scenes are busy, blurry, or taken with different kinds of lenses.
What questions are the researchers trying to answer?
In simple terms, they ask:
- Can we accurately figure out a camera’s settings from photos or videos taken “in the wild” (not in a lab)?
- Can we do this from one picture, and also get even better results when we have several views?
- Can we also figure out which way “up” is in the real world (the direction of gravity) from the images?
- Can a single system handle different lens types, including normal lenses and bendy, wide‑angle or fisheye lenses?
How did they approach the problem?
Think of it like giving a camera a “vision checkup” without using special test charts.
- Single view vs multi view:
- One image is like a single clue—it’s easy to get confused because different camera settings can produce similar pictures.
- Several images from different moments or angles are like multiple clues—together they make the answer clearer. CalibAny View is built to use one or many images, whichever you have.
- A helpful “map” inside each image:
- The system predicts two per‑pixel “maps” that describe perspective:
- Up Field: tiny arrows across the image showing where “up” (toward the sky/zenith) should be at each pixel. Imagine placing lots of little compasses on the picture, all pointing toward “up.”
- Latitude Field: a number at each pixel saying how far each sightline tilts above or below the horizon—like a tilt angle meter.
- It also predicts a confidence map, which tells the system which parts of the image are trustworthy (e.g., buildings with straight lines) and which are not (e.g., blurry or textureless areas).
- Sharing information across views:
- The core AI model is a transformer (a type of neural network good at spotting relationships). You can think of it as a team of readers comparing notes across frames: it looks within each image (intra‑frame attention) and across images (cross‑frame attention) to find consistent geometric clues shared by all views.
- Fine‑tuning with geometry:
- After the model predicts the Up and Latitude maps, a math step adjusts the camera settings so they line up with those maps as well as possible. This is an iterative “tweak until it fits” process (similar to trying different eyeglass prescriptions until things look sharp).
- A new, realistic dataset:
- To train and test fairly, they built a large video dataset from real 360° panoramas. They “cut out” many normal views from the spherical videos and simulated different camera motions and lenses (normal, radial distortion, and fisheye‑like). They also filtered out low‑quality clips. This creates lots of diverse, real‑world training examples where the true camera settings are known.
What did they find?
- Better accuracy than previous methods:
- In single‑image tests, CalibAny View matches or beats leading methods at estimating roll, pitch (how the camera is rotated), and field of view (how zoomed‑in/out it is).
- Improves with more views:
- When you add more frames, accuracy keeps getting better. The system takes advantage of shared information across views to resolve ambiguities that one image alone can’t.
- Robust to lens distortion:
- It handles different lens types (including wide and fisheye‑like distortions) in a single model instead of needing different models for different lenses.
- Works in challenging scenes:
- Classical multi‑view pipelines can fail when scenes are dynamic or when there isn’t enough overlap to reconstruct the 3D scene. CalibAny View still provides reliable calibration in those tougher “in‑the‑wild” cases.
Why this matters: Knowing the camera’s true settings and “which way is up” helps many tasks:
- 3D reconstruction: building accurate 3D models from photos.
- Robotics and drones: understanding orientation and distance for safer navigation.
- Augmented reality: placing virtual objects so they line up correctly with the real world.
Why is this important?
This research shows we can calibrate cameras directly from casual, everyday videos, not just from lab setups with checkerboards. It works if you have one photo and gets even better if you have a few. That makes it practical for smartphones, drones, and robots, and it can boost the reliability of many downstream technologies that depend on understanding the 3D world from images.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper.
- Principal point and full intrinsics modeling
- The principal point is fixed at the image center and focal length is assumed isotropic (single f). Real cameras often have off-center principal points, different fx/fy, skew, and non-square pixels. How does performance change if these are learned, and can the framework be extended to estimate full intrinsics?
- Sensitivity analysis to principal-point miscentering and aspect-ratio changes is missing; many real-world pipelines crop/stabilize images off-center.
- Lens/distortion model coverage
- Distortion is restricted to a single radial coefficient k1 or UCM with a single parameter ξ. Many lenses require higher-order radial terms, tangential/prism distortion (Brown–Conrady), or other fisheye/catadioptric models. How to extend and evaluate the method under richer distortion families?
- No evaluation on non-central or multi-perspective cameras (e.g., generalized cameras, catadioptric mirrors beyond UCM’s assumptions).
- Variable intrinsics across frames
- The multi-view optimization assumes shared intrinsics across a sequence. In-the-wild video often exhibits time-varying intrinsics due to optical/digital zoom, EIS cropping, focus breathing, or multi-camera switches on phones. How to detect and model per-frame or piecewise-constant intrinsics?
- Rolling-shutter and sensor/ISP effects
- The method and intermediate perspective-field representation implicitly assume global-shutter, rigid projection. Rolling-shutter distortions, temporal readout, and stabilization-induced warps are not modeled or tested. Can the approach be adapted to rolling-shutter cameras or videos with strong stabilization?
- Other sensor/ISP artifacts (vignetting, chromatic aberration, noise, compression) are not explicitly modeled; robustness to these is not quantified.
- Dependency on vertical/“upright” cues
- Perspective fields leverage up-vectors and latitude, implicitly relying on vertical structures or horizon-like cues. How does the method perform in scenes lacking strong verticals (e.g., forests, caves, underwater, sky-only), non-Manhattan environments, or highly cluttered indoor spaces?
- Dataset realism and domain gaps
- The multi-view dataset is synthesized by reprojecting gravity-aligned panoramas with transferred/augmented motions. This lacks real lens/sensor idiosyncrasies, hardware zoom, stabilization, and rolling-shutter effects. How large is the domain gap to truly captured multi-view sequences with ground-truth intrinsics?
- No validation against physical calibration targets or factory intrinsics on real devices (phones, drones, robots) to quantify real-world accuracy.
- Evaluation breadth
- Results largely focus on roll, pitch, and (v)FoV; principal point and fx/fy are not evaluated (since not estimated). Distortion accuracy is only partially assessed (k1 and relative pixel error). Broader metrics (e.g., per-pixel reprojection error under richer models) are missing.
- No stratified evaluation by scene type (indoor/outdoor, texture level, motion blur, dynamics) to reveal failure modes and robustness boundaries.
- Use of geometric constraints across views
- Cross-view consistency is handled via attention in the transformer and shared-intrinsics LM optimization, but there are no explicit multi-view geometric losses (e.g., epipolar constraints, line/vanishing-point consistency across frames) during training. Would explicit cross-view geometric supervision further reduce ambiguity?
- The approach does not incorporate sparse geometric cues (lines/VPs) when available; hybridizing learned fields with classical cues remains unexplored.
- Sequence length, frame selection, and scalability
- While accuracy improves with N, the computational/memory scaling and latency for longer sequences are not reported. What is the practical maximum N and runtime on common hardware?
- Optimal frame selection policies (baseline/overlap trade-offs, diversity vs redundancy) are not studied; does intelligent sub-sampling outperform uniform sampling?
- Uncertainty and confidence quantification
- The model learns per-pixel confidence for fields, but the final intrinsics/gravity estimates lack calibrated parameter uncertainties or confidence intervals. Can LM’s covariance or Bayesian approaches be used to provide uncertainty estimates?
- Calibration of the confidence maps (e.g., reliability diagrams) and how they translate into parameter uncertainty is not analyzed.
- Training dynamics and ablations
- The paper initializes with DINOv2/VGGT and uses DPT, but lacks ablations isolating the effect of each design choice (backbone, alternating attention, field resolution, loss weighting). Which components are most critical?
- End-to-end training through the geometric optimizer is not explored (training uses field supervision, not parameter-level supervision via differentiable LM). Would joint training through the optimizer improve final parameter accuracy?
- Handling dynamic scenes and occlusions
- Although the dataset includes dynamics, the impact of moving objects/occlusions on perspective-field quality and optimization is not dissected. Are there failure cases where dynamics bias the up/latitude predictions?
- Aspect ratio and preprocessing dependence
- Performance varies with “resize” vs “crop” preprocessing; generalization to arbitrary aspect ratios and resolutions is not systematically addressed. How to make predictions invariant to preprocessing choices?
- Integration with sensors and downstream tasks
- IMU fusion (gravity priors) is not considered; combining learned gravity with accelerometer data could resolve ambiguities and improve robustness.
- Claimed benefits to 3D reconstruction and robotics are not quantitatively demonstrated (e.g., improved SLAM/SfM accuracy when using estimated intrinsics/gravity); downstream impact studies are missing.
- Extending outputs beyond gravity
- The method estimates gravity (pitch/roll) but not full extrinsic pose. Joint estimation of intrinsics with partial or full extrinsics, or providing a gravity-aligned camera-to-world orientation where possible, remains an open extension.
- Failure mode analysis and benchmarks
- No systematic analysis of catastrophic failures (outliers) across datasets (e.g., nighttime/low light, extreme FoV > 180°, textureless scenes). Creating targeted stress tests and reporting robust statistics would guide future improvements.
Practical Applications
Overview
CalibAny View introduces a unified, learning-and-geometry hybrid framework that calibrates cameras “in the wild” from single images or multiple views (N ≥ 1). It estimates camera intrinsics (focal length and lens distortion) and the absolute gravity direction by predicting dense “perspective fields” and refining parameters via a Levenberg–Marquardt optimization. The method is robust to varied lenses (pinhole, radial, UCM/fisheye), dynamic scenes, and sparse views, and it improves as more views are provided. The paper also contributes a large, diverse multi-view dataset for training and benchmarking.
Below are actionable, real-world applications derived from these findings, grouped by deployment horizon. Each item includes sector linkages, likely tools/products/workflows, and key assumptions/dependencies that may affect feasibility.
Immediate Applications
These can be piloted or deployed with current capabilities, subject to integration and validation.
- Automatic lens calibration for consumer and prosumer video (software, media/creative)
- What: Batch-calibrate focal length and distortion for casual videos to stabilize horizons, correct fisheye distortions, and improve de-warping.
- Tools/products/workflows:
- NLE plugins (e.g., Adobe Premiere/After Effects, DaVinci Resolve, Final Cut) that run CalibAny View on clips to auto-generate lens profiles and horizon alignment.
- Standalone “AutoCalib” desktop app or command-line tools that export OpenCV-compatible intrinsics and distortion coefficients.
- Assumptions/dependencies:
- Principal point assumed centered (paper’s default); off-center sensors or heavy crop pipelines may reduce accuracy.
- Requires GPU or efficient CPU inference for transformer backbone; latency depends on sequence length.
- Best performance with multi-frame input; single frames work but are less accurate.
- On-the-fly calibration for drones and action cameras (robotics, mapping/GIS, consumer electronics)
- What: Calibrate cameras in the field without checkerboards to improve mapping, orthorectification, and visual odometry.
- Tools/products/workflows:
- Drone ground-control software plugin that runs multi-view calibration pre-flight or mid-flight on short video bursts.
- Export intrinsics to photogrammetry/SLAM stacks (e.g., COLMAP, OpenSfM, ORB-SLAM) to reduce reconstruction ambiguities.
- Assumptions/dependencies:
- Shared intrinsics across frames; zoom or focus changes invalidate “shared intrinsics” unless segmented by shot.
- Rolling shutter effects are not explicitly modeled; fast motion may require additional compensation.
- Robust initialization for 3D reconstruction, SLAM, and NeRF pipelines (software, robotics)
- What: Provide strong priors for intrinsics and gravity to accelerate convergence and reduce failure rates of SfM/SLAM/NeRF in dynamic or sparse-view scenes.
- Tools/products/workflows:
- Pre-processing node in COLMAP/OpenSfM/ElasticFusion/DROID-SLAM to initialize intrinsics and orientation.
- NeRF training scripts that fix intrinsics and gravity estimates up front to narrow search space.
- Assumptions/dependencies:
- Helpful where feature overlap is limited and classical self-calibration fails.
- Gravity estimation complements IMU and can detect IMU misalignment; fusion requires calibration of time and axes between sensors.
- Broadcast, sports, and CCTV analytics without calibration targets (media analytics, security/retail)
- What: Calibrate fixed or PTZ cameras using routine footage to enable metric scene understanding (player tracking, 3D trajectories, people counting).
- Tools/products/workflows:
- Edge or server-side services that periodically re-calibrate cameras from rolling footage to maintain accurate scene geometry.
- Assumptions/dependencies:
- For fixed installations with non-centered principal points or significant lens tilt, accuracy depends on the centered-PP assumption.
- Scene vertical cues improve performance; textureless or highly oblique scenes may reduce accuracy.
- AR measurement and horizon stabilization for smartphones (mobile software, AR/VR)
- What: Improve visual-only AR apps when IMU/magnetometer readings drift, and stabilize horizons for capture apps.
- Tools/products/workflows:
- Mobile SDK that combines CalibAny View gravity with IMU in a sensor fusion module; fallback when IMU is unreliable.
- Assumptions/dependencies:
- Real-time constraints require model distillation or on-device acceleration; portrait aspect ratios may need proper preprocessing (crop/resize choice affects FoV).
- For phones with known intrinsics, treat this as validation/health-check rather than replacement.
- VFX/matchmoving initialization for lens profiles (media/creative)
- What: Quickly estimate lens distortion from plates to seed high-precision matchmove workflows, reducing manual setup.
- Tools/products/workflows:
- Pipeline step exporting intrinsics to Nuke/Maya/Houdini matchmove tools; automated lens profile libraries for recurring shoots.
- Assumptions/dependencies:
- High-end VFX still requires sub-pixel precision; use as initialization and validate with traditional solves.
- Forensic and insurance scene analysis from dashcams/bodycams (public safety, insurance)
- What: Recover approximate camera intrinsics and gravity from ad‑hoc footage to assist scene reconstruction and evidence contextualization.
- Tools/products/workflows:
- Triage tools that auto-generate camera models and orientation envelopes from submitted clips for case pre-analysis.
- Assumptions/dependencies:
- Must include uncertainty quantification and validation protocols; evidentiary use requires documented accuracy bounds and chain-of-custody compliance.
- Dataset for benchmarking and training camera-aware models (academia, AI/ML tooling)
- What: Use the provided multi-view dataset to train/evaluate camera-aware perception, video generation, and calibration research.
- Tools/products/workflows:
- Public benchmarks for intrinsics + gravity estimation; pretraining modules for geometry-aware backbones.
- Assumptions/dependencies:
- Licensing/availability of the dataset and weights; ensure domain alignment when transferring to specialized contexts.
Long-Term Applications
These require additional research, scaling, or domain adaptation beyond the paper’s current scope.
- Fleet-scale auto-calibration and health monitoring (autonomous vehicles, robotics, logistics)
- What: Continual, targetless calibration for fleets (cars, delivery robots, warehouse AGVs) to detect drift, lens changes, or temperature-induced variations.
- Tools/products/workflows:
- Cloud service ingesting periodic video snippets to re-estimate intrinsics and alert when deviations exceed thresholds; integration with maintenance dashboards.
- Assumptions/dependencies:
- Needs handling of time-varying intrinsics (zoom/focus, temperature) and robust rolling-shutter/vehicle vibration modeling.
- Must formalize calibration uncertainty for safety certification (ISO 26262, IEC 61508).
- Multi-camera rig and cross-sensor calibration (automotive surround-view, AR glasses, multi-camera drones)
- What: Jointly calibrate multiple cameras (and possibly LiDAR/radar) by enforcing shared geometry and cross-view constraints without calibration targets.
- Tools/products/workflows:
- Rig-level optimizer that extends shared-intrinsics to per-camera shared constraints, combined with inter-sensor extrinsics estimation.
- Assumptions/dependencies:
- Paper solves per-sequence shared intrinsics for one camera; extending to multi-camera requires new cross-camera constraints and extrinsics recovery.
- Medical endoscopy and microscopy auto-calibration (healthcare)
- What: Estimate lens distortion and gravity/pose proxies for endoscopes/microscopes to improve 3D reconstruction and navigation without calibration phantoms.
- Tools/products/workflows:
- OR integration that calibrates from short clips before procedures; lab automation that self-calibrates microscopes across objectives.
- Assumptions/dependencies:
- Domain shift is substantial (textures, lighting, optics); requires specialized training data and potentially different lens models and priors.
- Compliance-grade calibration for smart infrastructure (policy, public sector, AECO)
- What: Standardize automated calibration for city cameras (traffic, safety) and construction monitoring to enable consistent metric analytics and audits.
- Tools/products/workflows:
- Policy frameworks specifying calibration recency, uncertainty ceilings, and auto-recalibration triggers; certified tooling based on CalibAny View-like methods.
- Assumptions/dependencies:
- Governance around data privacy and retention; formal validation protocols and periodic ground-truth checks.
- Camera-aware generative video and scene synthesis (media/AI content)
- What: Integrate calibration and gravity priors into camera-aware video generation and editing for physically plausible outputs and controllable virtual cinematography.
- Tools/products/workflows:
- Training generative models with the provided dataset and perspective fields as supervisory signals; editors that maintain consistent virtual camera metadata.
- Assumptions/dependencies:
- Requires tight coupling between generative priors and geometric constraints; expanded datasets with richer motions and lenses.
- Real-time, on-device calibration for AR wearables and robotics (AR/VR, embedded systems)
- What: Run lightweight calibration continuously to correct drift and maintain metric consistency in long-running AR/robotic deployments.
- Tools/products/workflows:
- Quantized/distilled models on edge accelerators; ROS2 nodes performing rolling-window multi-view calibration fused with IMU.
- Assumptions/dependencies:
- Efficiency and latency constraints; robust operation under low light and motion blur; resilience to sensor thermal drift.
- Automated QA for camera manufacturing and after-market lens accessories (manufacturing, consumer electronics)
- What: Non-contact, targetless QA to check intrinsics consistency across units or after lens changes.
- Tools/products/workflows:
- Factory end-of-line stations that run quick multi-frame captures through the model and compare against spec tolerances.
- Assumptions/dependencies:
- Requires controlled scene diversity or synthetic fixtures that emulate “in-the-wild” cues; regulatory acceptance for QA substitution.
- Scene-scale metrology and policy planning (urban planning, insurance risk, disaster assessment)
- What: Calibrated, crowd-sourced imagery for metric measurements (e.g., curb heights, road camber, flood levels) from ad-hoc videos.
- Tools/products/workflows:
- Platforms aggregating citizen videos, calibrating them automatically, and building metric 3D overlays for planners and assessors.
- Assumptions/dependencies:
- Requires strong error quantification, de-biasing across device types, and standardized reporting for decision-making.
Cross-cutting Assumptions and Dependencies
- Shared intrinsics across views: The multi-view optimization assumes frames share the same intrinsics; zoom/focus changes or multi-camera mixing require segmentation or extended models.
- Principal point fixed at image center: The paper fixes the principal point to mid-image; deviations (sensor misalignment, non-centered crops) can reduce accuracy and may require extending the parameterization.
- Lens models covered: Pinhole, simple radial (k1), and Unified Camera Model (UCM/fisheye) are supported; other distortions (tangential, higher-order, anamorphic) would need extensions.
- Gravity direction availability: Estimation relies on visual cues; indoors or scenes with weak verticals may degrade performance. Fusion with IMU can mitigate this.
- Computational footprint: Transformer backbone (DINOv2-based) and multi-view attention can be compute-intensive; deployment may need distillation or batching strategies.
- Data and licensing: Access to the released dataset and model weights (and their licenses) affects adoption; domain adaptation is needed for specialized contexts (medical, thermal, night scenes).
- Validation for safety-critical use: For AV/robotics/forensics, validated uncertainty estimates and fallback mechanisms (e.g., checkerboards) remain necessary.
These applications leverage CalibAny View’s core strengths—targetless calibration in unconstrained environments, cross-view consistency, lens-diverse support, and gravity estimation—and translate them into concrete products and workflows across industry, academia, policy, and daily life.
Glossary
- 6-DoF (degrees of freedom): The six independent parameters (3 for rotation, 3 for translation) describing camera pose in 3D space. "full 6-DoF camera extrinsics"
- Absolute orientation: A global orientation reference (e.g., relative to gravity) rather than just relative pose between views. "lacking a consistent notion of absolute orientation."
- Alternating attention: An attention scheme that alternates between within-frame and cross-frame reasoning to fuse information across views. "utilizes DINOv2 and an alternating attention mechanism"
- Area Under the recall Curve (AUC): A metric aggregating recall over error thresholds to evaluate calibration accuracy. "using the Area Under the recall Curve (AUC) at thresholds of 1º, 5°, and 10°."
- Bundle adjustment: Joint optimization of camera parameters and 3D structure across multiple views to minimize reprojection error. "fiducial markers for bundle ad- justment"
- Catadioptric models: Camera models combining lenses and mirrors, often enabling wide fields of view. "across pinhole, radial, fisheye, and catadioptric models"
- Confidence map: A per-pixel estimate of the reliability of predicted geometric quantities used to weight optimization. "the network outputs the perspective fields (U, ¢) and a confidence map o for each view."
- Cross-frame attention: Attention across frames to inject multi-view information and enforce shared geometric constraints. "the global cross-frame attention layers inject multi-view informa- tion"
- Cross-view geometric consistency: The requirement that geometry inferred from different views agrees, used to reduce ambiguity. "explicitly modeling cross-view geometric consistency."
- DINOv2: A vision transformer foundation model providing dense representations used as features for calibration. "we first leverage DINOv2 [35] to ob- tain dense patch-level representations"
- Dense Prediction Transformer (DPT): A transformer-based head for producing dense per-pixel predictions like perspective fields. "based on the Dense Prediction Transformer (DPT) architecture [39]"
- Differentiable solver: An optimization routine whose operations allow gradient propagation for end-to-end learning. "Our differentiable solver minimizes the weighted reprojection residual"
- Division model: A specific radial distortion model that maps undistorted to distorted coordinates via a division formula. "or division model [31]."
- Epipolar geometry: The geometric relationship between two views of the same scene that constrains point correspondences. "targetless self-calibration [37] based on epipolar geometry."
- Equirectangular frames: 360° panoramic images represented in a latitude-longitude grid. "project the equirectangular frames into vir- tual cameras"
- Fiducial markers: Designed markers (e.g., checkerboards) used to obtain precise correspondences for calibration. "use checkerboards or fiducial markers for bundle ad- justment"
- Field of View (FoV): The angular extent of the observable world captured by the camera. "camera's Field of View (FoV)"
- Fisheye lenses: Ultra-wide-angle lenses that introduce strong radial distortion and very large fields of view. "extreme fisheye lenses"
- Foundation models: Large pre-trained models providing general-purpose representations transferable to downstream tasks. "3D foundation models such as VGGT [52]"
- Gravity direction: A unit vector in the camera frame pointing toward the zenith, anchoring absolute orientation. "The gravity direction in the camera frame is defined as a unit vector g"
- Jacobian: The matrix of partial derivatives of a vector-valued function; here, of the projection function. "JT is the Jacobian of the projection function"
- Latitude field: A per-pixel angle between each viewing ray and the horizon used as part of the perspective representation. "Latitude Field (+)"
- Levenberg–Marquardt (LM) algorithm: An iterative method for solving nonlinear least-squares problems, blending gradient descent and Gauss-Newton. "the Levenberg-Marquardt (LM) algorithm"
- Likelihood-based weighting objective: A loss that treats prediction confidence as inverse-variance, balancing residuals and a log-confidence term. "This likelihood-based weighting objective [24]"
- Manhattan-world: A scene assumption with three dominant, mutually orthogonal directions (e.g., aligned with building axes). "Manhattan-world solvers built on three mutually orthogonal scene directions"
- NeRF (Neural Radiance Fields): A neural representation that models scenes by optimizing radiance and density fields from images. "NeRF-based methods offer another route"
- Non-linear least-squares: An optimization formulation minimizing the sum of squared nonlinear residuals. "solving a non-linear least-squares problem."
- Non-parametric camera model: A flexible camera model not constrained to a fixed parametric form, allowing complex distortions. "incorporate a non-parametric camera model into a SfM pipeline"
- Perspective fields: Dense per-pixel geometric maps (up-vectors and latitudes) that are camera-model-agnostic intermediates for calibration. "Perspective Fields [23] predict up-vectors and latitude per pixel"
- Pinhole model: The ideal projection model mapping 3D points to a normalized image plane without lens distortion. "Under the standard pinhole model"
- Principal point: The image coordinates where the optical axis intersects the image plane. "represents the principal point."
- Radial distortion: Lens-induced displacement of image points radially from the center, typical of wide-angle lenses. "Radial distortion can additionally be recovered from curved or covariant line segments"
- Reprojection residual: The difference between predicted projections and observed/modeled quantities used as an optimization error. "minimizes the weighted reprojection residual"
- Root-mean-square error (RMSE): A measure of average squared error magnitude, used here to match trajectories. "the lowest root-mean-square error (RMSE)"
- Self-calibration: Estimating camera parameters from image data alone, without special calibration targets. "perform self-calibration by leveraging geometric constraints across multiple im- ages."
- Self-supervised calibration: Learning calibration from video by optimizing consistency losses without explicit labels. "performing self-supervised calibration from video"
- Shared intrinsics: A multi-view assumption that all frames share the same intrinsic parameters during joint optimization. "Shared Intrinsics constraint"
- SLAM (Simultaneous Localization and Mapping): Estimating camera trajectory and a map of the environment from sequential imagery. "visual SLAM [7, 11, 15, 17, 46]"
- Spherical manifold: A curved space with spherical geometry; used here for optimizing orientation vectors on the unit sphere. "per-view orientations on a spherical manifold."
- Structure-from-Motion (SfM): Recovering camera poses and 3D structure from multiple overlapping images. "Structure-from-Motion (SfM) [1, 4, 9, 41]"
- Unified Camera Model (UCM): A camera model unifying perspective and fisheye-like projections with a spherical parameterization. "the Unified Camera Model (UCM) [34]"
- Umeyama fitting: A method for estimating similarity transforms (scale, rotation, translation) between point sets. "using Umeyama fitting [48]"
- Up Field (U): The per-pixel 2D directions in the image pointing toward the zenith, used as an intermediate representation. "the Up Field (U) and Latitude Field (+)"
- Up-vector field: A dense map of unit vectors on the image plane pointing toward the projected zenith. "Up-vector field U E RHxWx2"
- Vanishing points (VPs): Image points where projections of parallel 3D lines meet, revealing camera orientation and focal length cues. "Parallel lines converge at vanishing points (VPs)"
- VGGT: A 3D vision transformer model providing geometric priors used to initialize the feature extractor. "initialized with weights from a pre-trained VGGT model [52]"
- ViPE: A video pose estimator used to extract camera trajectories from video. "we first run panoramic visual SLAM [45] and ViPE [21] to extract camera trajectories"
- Vision-LLM (VLM): A model jointly processing images and text for tasks like automatic quality filtering. "a Vision-LLM (VLM)-based filtering pipeline"
- Visual odometry: Estimating the motion of a camera by analyzing sequential images. "such as visual odometry, Structure-from-Motion (SfM), and SLAM"
- Zenith: The upward vertical direction in the world frame; the gravity vector points toward it in the camera frame. "pointing towards the zenith."
Collections
Sign up for free to add this paper to one or more collections.