MapAnything: Universal 3D Scene Reconstruction

Updated 18 September 2025

MapAnything is a universal transformer-based model for metric 3D reconstruction, fusing multi-view images and diverse geometric inputs.
It employs a modular architecture with parallel image and geometry encoders, processing per-view depth maps, camera poses, and a global scale in a single pass.
Extensive evaluations show that MapAnything outperforms specialist models in tasks like multi-view stereo, depth estimation, and camera localization.

MapAnything denotes a universal, transformer-based feed-forward model for metric 3D reconstruction from visual and geometric data, designed to address a broad spectrum of multi-view scene understanding tasks. The model accepts variable numbers of images along with optional geometric side-inputs (such as camera intrinsics, depth maps, partial reconstructions, or poses) and predicts per-view depth, ray maps, camera poses, and a global metric scale factor, thus constructing globally consistent metric 3D scenes in a single pass. This architecture standardizes training and supervision across diverse datasets and input modalities, resulting in a flexible and efficient universal backbone for structure-from-motion, multi-view stereo, monocular depth estimation, depth completion, and camera localization (Keetha et al., 16 Sep 2025).

1. Architectural Principles and Factored Representation

MapAnything is based on a modular transformer backbone with parallel image and geometry encoders. Visual inputs (N images) are encoded via a pre-trained DINOv2 vision transformer, yielding patch-wise feature maps. Geometric side-inputs—when available—are separately encoded through shallow convolutional or MLP branches, enabling flexible input combinations (e.g., intrinsics, depth, poses).

Inputs are fused as tokens, including a learnable scale token. The main transformer (24 layers, alternating attention) processes joint tokens, integrating information across all views and side-inputs. Decoding is factored into per-view dense heads (for depth and local rays) and a pose head (for quaternions and translations). A separate MLP processes the scale token to regress the global metric scale, which upgrades all up-to-scale outputs to metric reconstructions.

The factored representation comprises:

Local, per-view depth maps $\tilde{D}_i$ and unit ray maps $R_i$ .
Camera poses as quaternions $Q_i$ and up-to-scale translations $T_i$ .
A global scale $m$ . World points are computed via $X_i = O_i \cdot (R_i \cdot \tilde{D}_i) + T_i$ , and upgraded to metric via $m \cdot X_i$ .

2. Input Modalities and Output Structure

MapAnything ingests:

Variable-length image sets.
Optional geometric data: camera intrinsics (ray maps), partial or dense depth, camera pose (quaternion and translation). Inputs are preprocessed to decouple factors and normalize global scale.

Outputs include:

For each view: pixelwise ray map $R_i$ , up-to-scale depth $\tilde{D}_i$ , and confidence/mask maps.
For each view: rotation (quaternion $Q_i$ ) and translation $T_i$ .
A scalar metric scale $m$ .

This schema permits both uncalibrated and calibrated inference, supporting heterogeneous input scenarios (image-only, image+intrinsics, image+depth, image+pose, etc.).

3. Training Paradigm and Supervision Standardization

Training is designed to unify supervision across heterogeneous datasets and annotation conventions. Loss terms include:

$L_1$ or robust adaptive losses on ray direction, rotation, and translation.
Scale-invariant losses on ray depth, local point maps, and world point maps, leveraging a custom log-space function $f_{\log}(x)=(x/||x||)\log(1+||x||)$ for scale robustness.
Confidence-weighted losses for masks and points.
Normal losses for surface orientation.
Multi-scale gradient losses for spatial detail.
Quaternion-specific geodesic loss handling the two-to-one mapping of quaternion rotations.
A dedicated loss on the metric scale $m$ .

Training employs covisibility-based view sampling, probabilistic input selection for side-factors, and aggressive augmentation—enabling robust handling of missing or incomplete inputs. Supervisory signals are harmonized to account for up-to-scale, depth-only, and pose-only data annotations across all datasets.

4. Experimental Evaluation

Extensive experiments show that MapAnything matches or outperforms specialist feed-forward models in various tasks:

In multi-view dense reconstruction, it achieves lower absolute relative errors than dedicated stereo systems.
In two-view setups, addition of geometric side-inputs (intrinsics, pose, depth) further reduces error.
On single-view camera calibration from images, MapAnything yields state-of-the-art angular prediction error.
On depth benchmarks (ETH3D, ScanNet, KITTI), robust performance is demonstrated even with partial input modalities.
Joint training across tasks and datasets improves generalization and training efficiency, yielding a practical universal 3D reconstruction backbone.

5. Applications

The model’s versatility enables application in:

Uncalibrated structure-from-motion (SfM) and calibrated multi-view stereo (MVS).
Monocular and multi-view depth estimation.
Camera localization and pose refinement.
Depth completion from sparse or partial reconstructions.
Robotics, AR/VR scene understanding, geospatial mapping, and rapid metric 3D reconstruction from diverse sensor suites.

Its feed-forward formulation supports rapid inference, efficient memory usage, and adaptability to variable input sizes and modalities.

6. Limitations and Future Prospects

The current feed-forward approach operates with fixed input-output correspondences, limiting scalability for very large, high-resolution scenes. The metric scale estimation assumes well-behaved side-inputs (poses, depth); explicit uncertainty modeling for noisy inputs is under exploration. Potential future directions include:

Dynamic scene parameterization (tracking motion and scene flow).
Iterative or test-time refinement loops for large-scale scenes.
Advanced multimodal fusion (late cross-attention, hierarchical encoding).
Improving decoupled output efficiency (beyond one-to-one pixel/point correspondence).

This model establishes a formal framework for universal, metric 3D scene reconstruction, bridging feed-forward transformer architectures with flexible, standardized multi-modal input/output processing, and sets the foundation for future advances in general-purpose 3D vision systems.

PDF Markdown Chat (Pro)

References (1)

MapAnything: Universal Feed-Forward Metric 3D Reconstruction (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MapAnything.