Papers
Topics
Authors
Recent
2000 character limit reached

MapAnything: Universal 3D Scene Reconstruction

Updated 18 September 2025
  • MapAnything is a universal transformer-based model for metric 3D reconstruction, fusing multi-view images and diverse geometric inputs.
  • It employs a modular architecture with parallel image and geometry encoders, processing per-view depth maps, camera poses, and a global scale in a single pass.
  • Extensive evaluations show that MapAnything outperforms specialist models in tasks like multi-view stereo, depth estimation, and camera localization.

MapAnything denotes a universal, transformer-based feed-forward model for metric 3D reconstruction from visual and geometric data, designed to address a broad spectrum of multi-view scene understanding tasks. The model accepts variable numbers of images along with optional geometric side-inputs (such as camera intrinsics, depth maps, partial reconstructions, or poses) and predicts per-view depth, ray maps, camera poses, and a global metric scale factor, thus constructing globally consistent metric 3D scenes in a single pass. This architecture standardizes training and supervision across diverse datasets and input modalities, resulting in a flexible and efficient universal backbone for structure-from-motion, multi-view stereo, monocular depth estimation, depth completion, and camera localization (Keetha et al., 16 Sep 2025).

1. Architectural Principles and Factored Representation

MapAnything is based on a modular transformer backbone with parallel image and geometry encoders. Visual inputs (N images) are encoded via a pre-trained DINOv2 vision transformer, yielding patch-wise feature maps. Geometric side-inputs—when available—are separately encoded through shallow convolutional or MLP branches, enabling flexible input combinations (e.g., intrinsics, depth, poses).

Inputs are fused as tokens, including a learnable scale token. The main transformer (24 layers, alternating attention) processes joint tokens, integrating information across all views and side-inputs. Decoding is factored into per-view dense heads (for depth and local rays) and a pose head (for quaternions and translations). A separate MLP processes the scale token to regress the global metric scale, which upgrades all up-to-scale outputs to metric reconstructions.

The factored representation comprises:

  • Local, per-view depth maps D~i\tilde{D}_i and unit ray maps RiR_i.
  • Camera poses as quaternions QiQ_i and up-to-scale translations TiT_i.
  • A global scale mm. World points are computed via Xi=Oi(RiD~i)+TiX_i = O_i \cdot (R_i \cdot \tilde{D}_i) + T_i, and upgraded to metric via mXim \cdot X_i.

2. Input Modalities and Output Structure

MapAnything ingests:

  • Variable-length image sets.
  • Optional geometric data: camera intrinsics (ray maps), partial or dense depth, camera pose (quaternion and translation). Inputs are preprocessed to decouple factors and normalize global scale.

Outputs include:

  • For each view: pixelwise ray map RiR_i, up-to-scale depth D~i\tilde{D}_i, and confidence/mask maps.
  • For each view: rotation (quaternion QiQ_i) and translation TiT_i.
  • A scalar metric scale mm.

This schema permits both uncalibrated and calibrated inference, supporting heterogeneous input scenarios (image-only, image+intrinsics, image+depth, image+pose, etc.).

3. Training Paradigm and Supervision Standardization

Training is designed to unify supervision across heterogeneous datasets and annotation conventions. Loss terms include:

  • L1L_1 or robust adaptive losses on ray direction, rotation, and translation.
  • Scale-invariant losses on ray depth, local point maps, and world point maps, leveraging a custom log-space function flog(x)=(x/x)log(1+x)f_{\log}(x)=(x/||x||)\log(1+||x||) for scale robustness.
  • Confidence-weighted losses for masks and points.
  • Normal losses for surface orientation.
  • Multi-scale gradient losses for spatial detail.
  • Quaternion-specific geodesic loss handling the two-to-one mapping of quaternion rotations.
  • A dedicated loss on the metric scale mm.

Training employs covisibility-based view sampling, probabilistic input selection for side-factors, and aggressive augmentation—enabling robust handling of missing or incomplete inputs. Supervisory signals are harmonized to account for up-to-scale, depth-only, and pose-only data annotations across all datasets.

4. Experimental Evaluation

Extensive experiments show that MapAnything matches or outperforms specialist feed-forward models in various tasks:

  • In multi-view dense reconstruction, it achieves lower absolute relative errors than dedicated stereo systems.
  • In two-view setups, addition of geometric side-inputs (intrinsics, pose, depth) further reduces error.
  • On single-view camera calibration from images, MapAnything yields state-of-the-art angular prediction error.
  • On depth benchmarks (ETH3D, ScanNet, KITTI), robust performance is demonstrated even with partial input modalities.
  • Joint training across tasks and datasets improves generalization and training efficiency, yielding a practical universal 3D reconstruction backbone.

5. Applications

The model’s versatility enables application in:

  • Uncalibrated structure-from-motion (SfM) and calibrated multi-view stereo (MVS).
  • Monocular and multi-view depth estimation.
  • Camera localization and pose refinement.
  • Depth completion from sparse or partial reconstructions.
  • Robotics, AR/VR scene understanding, geospatial mapping, and rapid metric 3D reconstruction from diverse sensor suites.

Its feed-forward formulation supports rapid inference, efficient memory usage, and adaptability to variable input sizes and modalities.

6. Limitations and Future Prospects

The current feed-forward approach operates with fixed input-output correspondences, limiting scalability for very large, high-resolution scenes. The metric scale estimation assumes well-behaved side-inputs (poses, depth); explicit uncertainty modeling for noisy inputs is under exploration. Potential future directions include:

  • Dynamic scene parameterization (tracking motion and scene flow).
  • Iterative or test-time refinement loops for large-scale scenes.
  • Advanced multimodal fusion (late cross-attention, hierarchical encoding).
  • Improving decoupled output efficiency (beyond one-to-one pixel/point correspondence).

This model establishes a formal framework for universal, metric 3D scene reconstruction, bridging feed-forward transformer architectures with flexible, standardized multi-modal input/output processing, and sets the foundation for future advances in general-purpose 3D vision systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MapAnything.