MapAnything: Universal 3D Scene Reconstruction
- MapAnything is a universal transformer-based model for metric 3D reconstruction, fusing multi-view images and diverse geometric inputs.
- It employs a modular architecture with parallel image and geometry encoders, processing per-view depth maps, camera poses, and a global scale in a single pass.
- Extensive evaluations show that MapAnything outperforms specialist models in tasks like multi-view stereo, depth estimation, and camera localization.
MapAnything denotes a universal, transformer-based feed-forward model for metric 3D reconstruction from visual and geometric data, designed to address a broad spectrum of multi-view scene understanding tasks. The model accepts variable numbers of images along with optional geometric side-inputs (such as camera intrinsics, depth maps, partial reconstructions, or poses) and predicts per-view depth, ray maps, camera poses, and a global metric scale factor, thus constructing globally consistent metric 3D scenes in a single pass. This architecture standardizes training and supervision across diverse datasets and input modalities, resulting in a flexible and efficient universal backbone for structure-from-motion, multi-view stereo, monocular depth estimation, depth completion, and camera localization (Keetha et al., 16 Sep 2025).
1. Architectural Principles and Factored Representation
MapAnything is based on a modular transformer backbone with parallel image and geometry encoders. Visual inputs (N images) are encoded via a pre-trained DINOv2 vision transformer, yielding patch-wise feature maps. Geometric side-inputs—when available—are separately encoded through shallow convolutional or MLP branches, enabling flexible input combinations (e.g., intrinsics, depth, poses).
Inputs are fused as tokens, including a learnable scale token. The main transformer (24 layers, alternating attention) processes joint tokens, integrating information across all views and side-inputs. Decoding is factored into per-view dense heads (for depth and local rays) and a pose head (for quaternions and translations). A separate MLP processes the scale token to regress the global metric scale, which upgrades all up-to-scale outputs to metric reconstructions.
The factored representation comprises:
- Local, per-view depth maps and unit ray maps .
- Camera poses as quaternions and up-to-scale translations .
- A global scale . World points are computed via , and upgraded to metric via .
2. Input Modalities and Output Structure
MapAnything ingests:
- Variable-length image sets.
- Optional geometric data: camera intrinsics (ray maps), partial or dense depth, camera pose (quaternion and translation). Inputs are preprocessed to decouple factors and normalize global scale.
Outputs include:
- For each view: pixelwise ray map , up-to-scale depth , and confidence/mask maps.
- For each view: rotation (quaternion ) and translation .
- A scalar metric scale .
This schema permits both uncalibrated and calibrated inference, supporting heterogeneous input scenarios (image-only, image+intrinsics, image+depth, image+pose, etc.).
3. Training Paradigm and Supervision Standardization
Training is designed to unify supervision across heterogeneous datasets and annotation conventions. Loss terms include:
- or robust adaptive losses on ray direction, rotation, and translation.
- Scale-invariant losses on ray depth, local point maps, and world point maps, leveraging a custom log-space function for scale robustness.
- Confidence-weighted losses for masks and points.
- Normal losses for surface orientation.
- Multi-scale gradient losses for spatial detail.
- Quaternion-specific geodesic loss handling the two-to-one mapping of quaternion rotations.
- A dedicated loss on the metric scale .
Training employs covisibility-based view sampling, probabilistic input selection for side-factors, and aggressive augmentation—enabling robust handling of missing or incomplete inputs. Supervisory signals are harmonized to account for up-to-scale, depth-only, and pose-only data annotations across all datasets.
4. Experimental Evaluation
Extensive experiments show that MapAnything matches or outperforms specialist feed-forward models in various tasks:
- In multi-view dense reconstruction, it achieves lower absolute relative errors than dedicated stereo systems.
- In two-view setups, addition of geometric side-inputs (intrinsics, pose, depth) further reduces error.
- On single-view camera calibration from images, MapAnything yields state-of-the-art angular prediction error.
- On depth benchmarks (ETH3D, ScanNet, KITTI), robust performance is demonstrated even with partial input modalities.
- Joint training across tasks and datasets improves generalization and training efficiency, yielding a practical universal 3D reconstruction backbone.
5. Applications
The model’s versatility enables application in:
- Uncalibrated structure-from-motion (SfM) and calibrated multi-view stereo (MVS).
- Monocular and multi-view depth estimation.
- Camera localization and pose refinement.
- Depth completion from sparse or partial reconstructions.
- Robotics, AR/VR scene understanding, geospatial mapping, and rapid metric 3D reconstruction from diverse sensor suites.
Its feed-forward formulation supports rapid inference, efficient memory usage, and adaptability to variable input sizes and modalities.
6. Limitations and Future Prospects
The current feed-forward approach operates with fixed input-output correspondences, limiting scalability for very large, high-resolution scenes. The metric scale estimation assumes well-behaved side-inputs (poses, depth); explicit uncertainty modeling for noisy inputs is under exploration. Potential future directions include:
- Dynamic scene parameterization (tracking motion and scene flow).
- Iterative or test-time refinement loops for large-scale scenes.
- Advanced multimodal fusion (late cross-attention, hierarchical encoding).
- Improving decoupled output efficiency (beyond one-to-one pixel/point correspondence).
This model establishes a formal framework for universal, metric 3D scene reconstruction, bridging feed-forward transformer architectures with flexible, standardized multi-modal input/output processing, and sets the foundation for future advances in general-purpose 3D vision systems.