To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video.

MapAnything: Universal Metric 3D Reconstruction

This presentation introduces MapAnything, a unified feed-forward transformer model capable of metric 3D reconstruction from any number of images. We will explore how its factored scene representation handles heterogeneous geometric inputs—like rays, poses, and depth—to solve over 12 different 3D vision tasks in a single forward pass without the need for complex modular pipelines.

Script

What if a single neural network could look at any number of photos and instantly understand the exact physical scale, camera angles, and depth of the entire scene? MapAnything achieves this by replacing complex 3D pipelines with one universal feed-forward model.

Moving beyond the limitations of traditional modular pipelines, the authors identify that current models are often too rigid, failing to exploit extra geometric data when it is available.

By factoring the scene into separate components like rays, depths, and a global metric scale, this approach avoids the redundancies of multiple overlapping pointmaps.

Let us dive into the flexible architecture that allows this model to process anything from a single photo to a massive multi-view dataset.

Following the encoding of images and geometry, a 24-layer transformer fuses information across views using a specialized learnable scale token to predict the final metric reconstruction.

To ensure versatility, the model is trained to accept various combinations of data, allowing it to improve its results whenever extra information like known poses or partial depth maps are provided.

This result demonstrates the model's ability to generate incredibly detailed 3D scenes from images alone, even when dealing with artistic renderings or monocular shots it wasn't specifically trained for.

Because of its robust design, the system matches or exceeds the performance of specialist models on major benchmarks while maintaining zero-shot capabilities.

Despite these successes, challenges remain in handling dynamic scenes and memory scaling, marking the clear path for future iterations of this architecture.

By unifying geometry and scale into a single feed-forward step, MapAnything unlocks a truly universal approach to seeing the world in 3D. Go to EmergentMind.com to learn more.