Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

MoGe-2: Advanced Monocular 3D Geometry Estimation

Updated 8 July 2025

MoGe-2 is an advanced monocular 3D geometry estimation framework that separates relative geometry from metric scale to deliver precise reconstructions from a single RGB image.
The architecture uses a DINOv2-based image encoder alongside dedicated convolutional and MLP decoders to produce detailed affine-invariant geometry and accurate metric scaling.
A refined data curation protocol leveraging synthetic and improved real datasets enhances boundary sharpness and overall 3D reconstruction accuracy across diverse scenes.

MoGe-2 refers to an advanced monocular geometry estimation framework that extends prior monocular 3D reconstruction approaches—most directly, the MoGe system—by enabling accurate 3D metric geometry reconstruction with sharp detail from a single RGB image (2507.02546). The method carefully disentangles relative geometry (shape) prediction from metric scale recovery and introduces a refined data curation protocol, establishing new performance baselines in open-domain geometry estimation.

1. Architectural Framework and Decoupling of Geometry and Scale

MoGe-2’s architecture builds upon the original MoGe model (2410.19115). Its foundation is an image encoder based on DINOv2 with interpolatable positional embeddings, accommodating arbitrary image resolutions (e.g., multiples of 14 pixels). Feature tokens and a classification (CLS) token are produced for each image.

Three major components define the architecture:

Relative Geometry Head: A lightweight convolutional decoder receives the token grid and progressively upsamples to yield a dense affine-invariant 3D point map, focusing on the relative scene structure.
Metric Scale Head: To achieve metric reconstruction, MoGe-2 introduces a dedicated scalar prediction branch—an MLP conditioned solely on the global CLS token. This head passes its output through an exponential function to ensure positive scale.
Task-Specific Decoders: Additional heads branch off from the shared neck for auxiliary tasks such as mask prediction and surface normal estimation.

By separating the scale prediction pathway (global context only) from the local geometry decoder (fine, high-frequency features), MoGe-2 preserves relative geometry accuracy while providing robust, unbiased metric scaling across varied domains.

2. Mathematical Underpinnings and Alignment Protocols

MoGe-2 employs formal alignment procedures both in training (incorporated into loss functions) and in evaluation. Given that monocular depth estimation inherently involves global scale and shift ambiguities, multiple alignment variants are leveraged:

Scale-only alignment:

$a^* = \arg\min_a \sum_{i \in \mathcal{N}} \frac{1}{z_i} \|a\cdot \hat{p}_i - p_i\|_1$

Where $\hat{p}_i$ is the model prediction, $p_i$ the ground truth, $z_i$ a depth weight, and $\mathcal{N}$ the valid mask.

Affine alignment (scale and shift):

$(a^*, b^*) = \arg\min_{a, b} \sum_{i \in \mathcal{N}} \frac{1}{z_i} \|a\cdot \hat{p}_i + b - p_i\|_1$

Depth map alignment:

$(a^*, b^*) = \arg\min_{a, b} \sum_{i \in \mathcal{N}} \frac{1}{z_i} |a\cdot \hat{z}_i + b - z_i|$

Disparity alignment (least-squares, where $\hat{d}_i = 1/\hat{z}_i$ ):

$(a^*, b^*) = \arg\min_{a, b} \sum_{i \in \mathcal{N}} (a\cdot \hat{d}_i + b - d_i)^2$

$\hat{z}_i^* = 1/ \max(a^*\cdot\hat{d}_i + b^*, 1/z_{\max})$

These alignment protocols ensure robust and fair evaluation of both relative and metric geometry. The metric scale is extracted directly from the separate MLP head, not imposed via these alignments during inference.

A central innovation of MoGe-2 is its unified data refinement pipeline to address the smoothing and loss of detail commonly encountered with real-world scan data:

Synthetic Data as Quality Guide: High-fidelity, sharp-boundary synthetic datasets are leveraged as reference labels to "refine" noisy or coarsely labeled real scans. The process involves consistency checks or filtering real-world depth values using synthetic model predictions, reducing outlier impact and boundary blurring.
Fusion of Improved Real Data: Training is conducted on these "improved real" labels. MoGe-2 thus learns both global metric correctness and fine local geometric structure.
Ablation Results: Three training regimes were compared—synthetic-only (sharp, but less accurate globally), raw real-only (globally accurate, but smoothed), and improved real data (sharp and accurate). The improved real protocol yielded highest accuracy and sharpest details.

4. Fine-Grained Boundary Recovery and Evaluation

MoGe-2 achieves superior granularity in detail due to its combination of data refinement and the dense decoder design. High-resolution boundary detail and shape accuracy are quantified using standard boundary F1 scores and error measurements on 3D point maps.

Robustness to input resolution is shown by varying the spatial density of encoder tokens, with MoGe-2 maintaining sharp geometric features at a range of computational latencies.

5. Quantitative Performance and Benchmarking

MoGe-2 is comprehensively evaluated on benchmark datasets including indoor, outdoor, and open-domain scenes. Metrics encompass:

Relative geometry error (after affine alignment): MoGe-2 outperforms MoGe (2410.19115), UniDepth, Depth Pro, and Metric3D V2, with lower mean relative error and higher inlier rates.
Metric accuracy (without alignment): By virtue of the explicit scale head, MoGe-2 achieves significantly lower absolute error in metric depth compared to prior methods.
Boundary sharpness: F1 scores and qualitative inspection confirm sharper, more accurate edge recovery.
Efficiency: The decoupled design and lightweight decoder ensure runtime efficiency for a given accuracy level.

These results establish MoGe-2 as the first approach to simultaneously excel in both accurate relative and metric 3D geometry prediction and fine detail reconstruction across diverse image domains.

6. Implications and Context

MoGe-2 extends the practical frontiers of single-image 3D scene reconstruction. The explicit separation of metric scale prediction (via a global token MLP) from relative geometry recovery addresses longstanding limitations of monocular pipelines in open-domain settings. The data refinement strategy demonstrates, for the first time, that synthetic data can systematically improve the fine detail recovery in real-world 3D scans without sacrificing global accuracy.

This framework has direct relevance to downstream tasks such as AR/VR scene understanding, autonomous navigation, object insertion, and large-scale mapping.

7. Limitations and Future Directions

While MoGe-2 demonstrates significant advancements, a plausible implication is that extension to domains with fundamentally novel scene structures or extremely sparse ground-truth geometry may still pose challenges. Further research is warranted to generalize the refinement protocol across sensing modalities and to couple the geometry estimation with semantic reasoning for holistic scene understanding.

In summary, MoGe-2 marks a substantial step forward in monocular geometry estimation by achieving unified, metrically accurate, and sharp-detail 3D reconstruction from a single image. The approach combines a decoupled, efficient architecture with refined data supervision, attaining results that previous methods have not simultaneously realized (2507.02546).

PDF Markdown Chat (Upgrade)

References (2)

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details (2025)

MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision (2024)