Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding

Published 29 Apr 2026 in cs.CV | (2604.26261v1)

Abstract: Zero-shot 3D Visual Grounding (3DVG) is a critical capability for open-world embodied AI. However, existing methods are fundamentally bottlenecked by the poor quality of open-vocabulary 3D proposals, suffering from inaccurate categories and imprecise geometries, as well as the spatial redundancy of exhaustive multi-view reasoning. To address these challenges, we propose MCM-VG, a novel framework that achieves robust zero-shot 3DVG by explicitly establishing Multiple Consistent 2D-3D Mappings. Instead of passively relying on noisy 3D segments, MCM-VG enforces 2D-3D consistency across three fundamental dimensions to achieve precise target localization and reliable reasoning. First, a Semantic Alignment module corrects category mismatches via LLM-driven query parsing and coarse-to-fine 2D-3D matching. Second, an Instance Rectification module leverages VLM-guided 2D segmentations to reconstruct missing targets, back-projecting these reliable visual priors to establish accurate 3D geometries. Finally, to eliminate spatial redundancy, a Viewpoint Distillation module clusters 3D camera directions to extract optimal frames. By pairing these optimal RGB frames with Bird's Eye View maps into concise visual prompt sets, we formulate the final target disambiguation as a multiple-choice reasoning task for Vision-LLMs. Extensive evaluations on ScanRefer and Nr3D benchmarks demonstrate that MCM-VG sets a new state-of-the-art for zero-shot 3D visual grounding. Remarkably, it achieves 62.0\% and 53.6\% in [email protected] and [email protected] on ScanRefer, outperforming previous baselines by substantial margins of 6.4\% and 4.0\%.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces MCM-VG, a novel framework that enforces multi-level consistency between 2D visual cues and 3D geometries for robust zero-shot visual grounding.
It employs three integrated modules – Semantic Alignment, Instance Rectification, and Viewpoint Distillation – that synergistically overcome limitations of noisy 3D proposals.
Empirical results on ScanRefer and Nr3D datasets show significant accuracy gains, validating MCM-VG's practical impact for embodied AI in complex scenes.

Robust Zero-Shot 3D Visual Grounding with Multiple Consistent 2D-3D Mappings

Introduction

The paper "Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding" (2604.26261) proposes MCM-VG, a framework addressing core challenges in open-world zero-shot 3D Visual Grounding (3DVG). Zero-shot 3DVG enables embodied agents to localize arbitrary objects in cluttered 3D environments based on free-form natural language queries, a critical capability for robotics and scene understanding. Existing zero-shot methods fundamentally depend on noisy open-vocabulary 3D proposals, rendering them susceptible to category mispredictions, geometric imprecision, and spatial redundancy during exhaustive multi-view reasoning. MCM-VG introduces explicit consistency enforcement between 2D observations and 3D geometries across semantic, instance, and viewpoint dimensions using a cascaded architecture comprising Semantic Alignment, Instance Rectification, and Viewpoint Distillation modules.

Figure 1: Comparative illustration of MCM-VG versus SeeGround and SeqVLM, highlighting multi-level 2D-3D consistency enforcement for improved 3D visual grounding.

Methodology

MCM-VG is systematically structured into three tightly integrated modules. Each module is designed to rectify specific deficiencies of prior zero-shot pipelines and collectively synthesize robust 2D-3D mappings for precise spatial grounding without task-specific 3D-text annotations.

Figure 2: Schematization of the MCM-VG pipeline, outlining semantic alignment, instance rectification, and viewpoint distillation for zero-shot 3D visual grounding.

Semantic Alignment Module

Open-vocabulary 3D segmentation models suffer from brittle category prediction due to limited taxonomy and lack of full scene context. The Semantic Alignment module ameliorates this by parsing the language query with an LLM, extracting target and anchor categories, and obtaining top-10 candidate categories. Initial 3D proposals are projected onto multi-view 2D planes, followed by coarse-to-fine matching with CLIP text-visual embeddings to filter high-confidence candidates. Matched proposals undergo further validation using advanced text-driven 2D segmentation models. A spatial alignment score, based on multi-view IoU and confidence aggregation, is employed to refine the proposal set, propagate reliable semantics, and rectify mismatches, forming a semantically consistent object pool.

Instance Rectification Module

When initial 3D proposals fail due to geometric inaccuracies or complete miss-detection, the rectification module leverages 2D visual priors as robust fallback. Multi-modal VLMs, augmented with anchor-based frame selection, generate precise spatial point prompts (two positive, one negative per frame). These cues feed into 2D segmentation models (e.g., SAM3), yielding pixel-level masks, which are then back-projected into 3D space via depth-aware reconstruction and spatial denoising. This enables recovery of missing target geometries, immune to category and bounding box failures of standard 3D detectors.

Viewpoint Distillation Module

To counter spatial redundancy and inefficient VLM token consumption, the viewpoint distillation module quantifies camera directions for optimal frame selection. High-quality multi-view frames are clustered based on angular similarity, retaining only representative perspectives with maximal visibility. Each selected RGB frame is paired with a Bird's Eye View (BEV) map, annotated with spatial markers, directional arrows, and object IDs. Visual prompt pairs are concatenated, forming concise input sets for final VLM spatial reasoning—formulated as a multiple-choice task—achieving global spatial disambiguation.

Empirical Results

Extensive experiments on ScanRefer and Nr3D datasets demonstrate the efficacy of MCM-VG over both zero-shot and fully-supervised baselines. On ScanRefer, MCM-VG achieves 62.0% [email protected] and 53.6% [email protected], outperforming the previous best zero-shot method SeqVLM by absolute margins of 6.4% and 4.0%, respectively, and surpassing performance in the "Multiple" subset by up to 7.1% [email protected]. On Nr3D, MCM-VG achieves 57.2% overall accuracy, exceeding SeqVLM by 4.0%, with notable gains in the hardest splits.

Figure 3: Qualitative visualization of MCM-VG 3D grounding outputs on ScanRefer, exemplifying robust localization in cluttered, ambiguous scenarios.

Ablation analysis confirms that each module—Semantic Alignment, Instance Rectification, and Viewpoint Distillation—contributes non-trivially to final accuracy. Removing any one drops performance by 4–5%, with the largest penalty incurred by omitting spatial clustering in viewpoint distillation. Hyperparameter studies show robustness to parameter variation, with optimal values for semantic matching threshold ( $\gamma$ ), angular clustering ( $\epsilon$ ), and multi-view frame count yielding stable gains.

Figure 4: Performance sensitivity plot with respect to $\gamma$ , documenting the effect of semantic alignment thresholding on grounding accuracy.

Comparative visualization against SeeGround and SeqVLM shows MCM-VG successfully addressing scenarios involving occluded or diminutive objects, where baseline approaches are frequently misled by category ambiguity or incomplete geometry.

Figure 5: Additional visualization results on ScanRefer, demonstrating consistency and resilience in diverse multi-instance, occluded settings.

Limitations

Despite strong accuracy, MCM-VG currently incurs high computational overhead due to sequential invocation of multiple foundation models (LLM, VLM, 2D/3D segmentation). This restricts deployment to offline or latency-flexible environments. The framework also depends on sufficient 2D view coverage—severe occlusion or truncation impedes 3D reconstruction by the rectification module. VLM spatial precision remains limited in highly cluttered scenes, potentially yielding ambiguous point prompts for segmentation. Robustness to adversarial viewpoints and model-distilled acceleration are cited as future directions.

Figure 6: Failure cases illustrating degraded performance under extreme occlusion and marginal view truncation.

Implications and Future Directions

The explicit 2D-3D consistency enforced by MCM-VG represents a step toward harnessing self-supervised generalization in embodied AI agents without closed-set annotation dependence. Practically, this enables robust localization and interaction in open-world settings with unseen object categories, crucial for robotics and AR/VR. Theoretically, the modular framework validates the efficacy of joint semantic, geometric, and viewpoint alignment in overcoming proposal noise and redundancy, supporting future research into unified end-to-end architectures and tighter foundation model integration. Advances in spatially aware VLMs, real-time proposal rectification, and adversarial occlusion handling are promising avenues to further reduce the gap with full supervision.

Conclusion

MCM-VG offers a comprehensive solution to zero-shot 3D visual grounding by leveraging multiple consistent mappings between 2D visual cues and 3D geometries. Three cascaded modules synergistically rectify semantic mismatches, instance mislocalization, and viewpoint redundancy. Significant empirical gains over previous baselines validate the approach. While computationally intensive, MCM-VG successfully bridges modality gaps and unlocks the potential of foundation models for open-vocabulary spatial reasoning, setting the stage for practical, annotation-free embodied AI deployments.

Markdown Report Issue