Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 66 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

MapAnything: Universal 3D Scene Reconstruction

Updated 18 September 2025
  • MapAnything is a universal transformer-based model for metric 3D reconstruction, fusing multi-view images and diverse geometric inputs.
  • It employs a modular architecture with parallel image and geometry encoders, processing per-view depth maps, camera poses, and a global scale in a single pass.
  • Extensive evaluations show that MapAnything outperforms specialist models in tasks like multi-view stereo, depth estimation, and camera localization.

MapAnything denotes a universal, transformer-based feed-forward model for metric 3D reconstruction from visual and geometric data, designed to address a broad spectrum of multi-view scene understanding tasks. The model accepts variable numbers of images along with optional geometric side-inputs (such as camera intrinsics, depth maps, partial reconstructions, or poses) and predicts per-view depth, ray maps, camera poses, and a global metric scale factor, thus constructing globally consistent metric 3D scenes in a single pass. This architecture standardizes training and supervision across diverse datasets and input modalities, resulting in a flexible and efficient universal backbone for structure-from-motion, multi-view stereo, monocular depth estimation, depth completion, and camera localization (Keetha et al., 16 Sep 2025).

1. Architectural Principles and Factored Representation

MapAnything is based on a modular transformer backbone with parallel image and geometry encoders. Visual inputs (N images) are encoded via a pre-trained DINOv2 vision transformer, yielding patch-wise feature maps. Geometric side-inputs—when available—are separately encoded through shallow convolutional or MLP branches, enabling flexible input combinations (e.g., intrinsics, depth, poses).

Inputs are fused as tokens, including a learnable scale token. The main transformer (24 layers, alternating attention) processes joint tokens, integrating information across all views and side-inputs. Decoding is factored into per-view dense heads (for depth and local rays) and a pose head (for quaternions and translations). A separate MLP processes the scale token to regress the global metric scale, which upgrades all up-to-scale outputs to metric reconstructions.

The factored representation comprises:

  • Local, per-view depth maps D~i\tilde{D}_i and unit ray maps RiR_i.
  • Camera poses as quaternions QiQ_i and up-to-scale translations TiT_i.
  • A global scale mm. World points are computed via Xi=Oi(RiD~i)+TiX_i = O_i \cdot (R_i \cdot \tilde{D}_i) + T_i, and upgraded to metric via mXim \cdot X_i.

2. Input Modalities and Output Structure

MapAnything ingests:

  • Variable-length image sets.
  • Optional geometric data: camera intrinsics (ray maps), partial or dense depth, camera pose (quaternion and translation). Inputs are preprocessed to decouple factors and normalize global scale.

Outputs include:

  • For each view: pixelwise ray map RiR_i, up-to-scale depth D~i\tilde{D}_i, and confidence/mask maps.
  • For each view: rotation (quaternion QiQ_i) and translation TiT_i.
  • A scalar metric scale mm.

This schema permits both uncalibrated and calibrated inference, supporting heterogeneous input scenarios (image-only, image+intrinsics, image+depth, image+pose, etc.).

3. Training Paradigm and Supervision Standardization

Training is designed to unify supervision across heterogeneous datasets and annotation conventions. Loss terms include:

  • L1L_1 or robust adaptive losses on ray direction, rotation, and translation.
  • Scale-invariant losses on ray depth, local point maps, and world point maps, leveraging a custom log-space function flog(x)=(x/x)log(1+x)f_{\log}(x)=(x/||x||)\log(1+||x||) for scale robustness.
  • Confidence-weighted losses for masks and points.
  • Normal losses for surface orientation.
  • Multi-scale gradient losses for spatial detail.
  • Quaternion-specific geodesic loss handling the two-to-one mapping of quaternion rotations.
  • A dedicated loss on the metric scale mm.

Training employs covisibility-based view sampling, probabilistic input selection for side-factors, and aggressive augmentation—enabling robust handling of missing or incomplete inputs. Supervisory signals are harmonized to account for up-to-scale, depth-only, and pose-only data annotations across all datasets.

4. Experimental Evaluation

Extensive experiments show that MapAnything matches or outperforms specialist feed-forward models in various tasks:

  • In multi-view dense reconstruction, it achieves lower absolute relative errors than dedicated stereo systems.
  • In two-view setups, addition of geometric side-inputs (intrinsics, pose, depth) further reduces error.
  • On single-view camera calibration from images, MapAnything yields state-of-the-art angular prediction error.
  • On depth benchmarks (ETH3D, ScanNet, KITTI), robust performance is demonstrated even with partial input modalities.
  • Joint training across tasks and datasets improves generalization and training efficiency, yielding a practical universal 3D reconstruction backbone.

5. Applications

The model’s versatility enables application in:

  • Uncalibrated structure-from-motion (SfM) and calibrated multi-view stereo (MVS).
  • Monocular and multi-view depth estimation.
  • Camera localization and pose refinement.
  • Depth completion from sparse or partial reconstructions.
  • Robotics, AR/VR scene understanding, geospatial mapping, and rapid metric 3D reconstruction from diverse sensor suites.

Its feed-forward formulation supports rapid inference, efficient memory usage, and adaptability to variable input sizes and modalities.

6. Limitations and Future Prospects

The current feed-forward approach operates with fixed input-output correspondences, limiting scalability for very large, high-resolution scenes. The metric scale estimation assumes well-behaved side-inputs (poses, depth); explicit uncertainty modeling for noisy inputs is under exploration. Potential future directions include:

  • Dynamic scene parameterization (tracking motion and scene flow).
  • Iterative or test-time refinement loops for large-scale scenes.
  • Advanced multimodal fusion (late cross-attention, hierarchical encoding).
  • Improving decoupled output efficiency (beyond one-to-one pixel/point correspondence).

This model establishes a formal framework for universal, metric 3D scene reconstruction, bridging feed-forward transformer architectures with flexible, standardized multi-modal input/output processing, and sets the foundation for future advances in general-purpose 3D vision systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MapAnything.