3D Visual Grounding (3DVG)

Updated 8 December 2025

3DVG is the task of localizing objects in 3D scenes using free-form language descriptions, outputting oriented boxes or instance masks.
Recent methods employ two-stage and transformer-based architectures with multimodal fusion to enhance localization accuracy and spatial reasoning.
Challenges include handling dynamic environments, complex language phenomena, and sensor variations, prompting research toward robust, real-time, and zero-shot systems.

Three-Dimensional Visual Grounding (3DVG) is the task of localizing objects in 3D scenes based on natural-language descriptions, fundamental for robotics, embodied AI, AR/VR, and multimodal interaction. The field has advanced rapidly with datasets, architectures, and a spectrum of supervised and zero-shot methods, while absorbing challenges from dynamic environments, multimodal perception, and rich language phenomena.

1. Core Problem Definition and Task Variants

At its core, 3DVG seeks a function mapping $(S, D)$ to a 3D region, where $S$ is the 3D scene (e.g., point cloud, mesh, LiDAR sweep) and $D$ is a free-form referring expression. The output is typically an oriented or axis-aligned 3D bounding box $B \in \mathbb{R}^6$ (center, size) or an instance mask selecting a set of points/voxels.

Key task variants:

Scene Input Modality: Indoor RGB-D point clouds (ScanNet/ScanRefer), LiDAR sweeps for outdoor scenes (Talk2Car-3D), or implicit 3D representations (e.g., Gaussians in 3DGS).
Supervision regime: Fully supervised with 3D box/mask ground truth, weak/no supervision (zero-shot), cross-dataset generalization, or multi-task setups (joint 3DVG and dense captioning).
Query richness: From synthetic spatial templates (“the chair farthest from the cabinet” (Hu et al., 16 Oct 2025)) to complex, linguistically diverse prompts covering negation, counting, and coreference (Wang et al., 2 Jan 2025).

Formally, the generalized 3DVG task (as argued for real-world deployment) takes as input

$\langle S_p, S_c, M_p, D_c \rangle \mapsto B,$

where $S_p$ is the prior scan, $S_c$ the current (possibly unexplored) scene, $M_p$ a memory bank of RGB-D views with poses, and $D_c$ the natural-language query (Hu et al., 16 Oct 2025). System objectives include maximizing localization accuracy

$\mathrm{Acc}@\tau = \frac{\#\{\mathrm{IoU}(B, B^{\text{gt}}) \ge \tau\}}{\text{total}}$

while minimizing exploration action count and motion cost.

2. Datasets, Benchmarks, and Evaluation Methodologies

Indoor datasets:

ScanRefer/ScanNet: >50K queries over 800 scenes, providing point clouds, object proposals, and free-form queries (Wang et al., 16 Jan 2025).
ReferIt3D (Nr3D/Sr3D): Focus on single-object reference games with less ambiguity, diverse anchor/target phenomena (Huang et al., 2022).
3RScan/ChangingGrounding: Dynamic scenes (1,482 scans, 478 environments) aligned over time, emphasizing memory-aware grounding under scene changes (Hu et al., 16 Oct 2025).
ViGiL3D: Diagnostic benchmark of 350 prompts across 30 linguistic metrics, highlighting phenomena previously under-studied, such as negation and ordinal reference (Wang et al., 2 Jan 2025).

Outdoor datasets:

Talk2Car-3D: LiDAR-based 3DVG for autonomous driving (7,115/998/2,056 splits), supporting fine-to-coarse category inference and large-scale point clouds (Baek et al., 7 Nov 2024).

Metrics:

Accuracy @ IoU=0.25, 0.5: Fraction of predicted boxes/masks with IoU above threshold.
Recall@1 or selection accuracy: For datasets with a finite candidate set (e.g., Sr3D, Nr3D).
Action/motion cost: For active/embodied setups (Hu et al., 16 Oct 2025).
Dense mask-based F1 vs. box-based accuracy: For fine-grained geometric localization (Unal et al., 2023).

Cross-dataset evaluation is emphasized as critical to test robustness to sensor, reconstruction, and annotation variations (Miyanishi et al., 2023).

3. Model Architectures and Algorithmic Advances

Canonical supervised pipelines:

Two-stage “detect-then-ground”: 3D instance segmentation (e.g., PointGroup, VoteNet, Mask3D) produces candidate proposals; cross-modal transformer aligns proposals with language to select the target (Huang et al., 2022, Wang et al., 16 Jan 2025).
Single-stage DETR-style: End-to-end detection and grounding with transformer decoders directly outputting box predictions conditioned on fused vision-language features, incorporating prompt-based object localization (Luo et al., 17 Apr 2024).

Advancements in cross-modal fusion and efficiency:

Sparse convolutional backbones with text-guided pruning: TSP3D prunes irrelevant voxels iteratively based on textual cues and restores fine details with completion-based addition, achieving real-time 12.4 FPS with SOTA accuracy (Guo et al., 14 Feb 2025).
Multi-view modeling: MVT encodes the same scene under multiple yaw rotations, aggregates cross-modal features, and achieves robust view-invariant grounding (+11.2% accuracy improvement on Nr3D) (Huang et al., 2022).
Dense mask prediction (instance segmentation): ConcreteNet introduces bottom-up attentive fusion, learned camera tokens for view-dependent queries, and multi-view ensembling, producing fine-grained referred masks rather than boxes (Unal et al., 2023).

Language and spatial reasoning:

Phrase-level alignment: 3DPAG jointly localizes every phrase-object pair via a phrase-object alignment map and phrase-specific pre-training, improving both fine-grained grounding and interpretability (Yuan et al., 2022).
Spatial relation modeling: Language-Spatial Adaptive Decoder (LSAD) in AugRefer fuses language, pairwise, and global 3D spatial relations within the attention blocks, yielding large gains especially on “multiple” and spatially ambiguous queries (Wang et al., 16 Jan 2025).

Modality extensions:

Audio-based grounding: Audio-3DVG models speech input with an Object Mention Detection head and audio-guided attention, outperforming prior audio baselines and narrowing the text–audio performance gap (Cao-Dinh et al., 1 Jul 2025).
Robustness to noisy/ambiguous speech: SpeechRefer combines phonetic-aware refinement and speech–text–vision contrastive alignment, maintaining ∼2–3.6 pp accuracy gains under transcription errors and real-world noise (Qi et al., 17 Jun 2025).

Memory-driven and active agents:

Changing scenes: Mem-ChangingGrounder integrates cross-modal retrieval, memory reuse, targeted multi-view scanning, and fallback exploration to maximize accuracy while minimizing exploration costs, outperforming static baselines in dynamic environments (Hu et al., 16 Oct 2025).

4. Zero-Shot and Foundation Model-Based 3DVG

Recent work has explored leveraging foundation models for “open-vocabulary” or zero-shot 3DVG:

2D Vision-LLM (VLM) transfer: SeeGround, ZSVG3D: render query-aligned 2D images from 3D scenes, enrich them with spatial text prompts; use VLMs (Qwen2-VL-72B, CLIP, GPT-4V) to output the target's identity without 3D-specific training. SeeGround achieves 44.1%/39.4% ([email protected]/0.5) overall on ScanRefer, +7.7% over prior zero-shot (Li et al., 28 May 2025, Li et al., 5 Dec 2024).
Visual programming for reasoning: LLMs generate and execute modular grounder programs (encoding geometric, spatial, view-dependent, and open-vocabulary modules), enabling compositional, explainable zero-shot 3DVG exceeding several supervised baselines under strict ([email protected]) metrics (Yuan et al., 2023).
Constraint satisfaction as reasoning: CSVG converts language-to-3DVG into CSPs, benefiting from global spatial consistency, explicit min-max and negation handling (e.g., “without any trash can beside it”), substantially outperforming earlier zero-shot strategies (+7.0% [email protected] on ScanRefer, +11.2% on Nr3D) (Yuan et al., 21 Nov 2024).
Hybrid representations for foundation model adaptation: S²-MLLM introduces 3D structure-aware training objectives (feed-forward 3D reconstruction) and structure-enhanced modules with intra/inter-view attention and multi-level position encoding to enable efficient, robust 3D spatial reasoning without 3D reconstruction at inference, achieving 59.2%/52.7% on ScanRefer (Xu et al., 1 Dec 2025).
Multi-modal cross-domain transfer: Zero-shot pipelines on 3D Gaussian Splatting (3DGS), such as GVR, recast 3DVG as multi-view retrieval/segmentation via CLIP/SAM, then fuse spatial information back into 3D (Liao et al., 19 Sep 2025).

5. Training Protocols, Augmentation, and Limitations

Training methodologies:

Cross-modal data augmentation: AugRefer systematically generates new object insertions and diverse captions via hybrid rendering and LLM-based generation, synthetically expanding the training distribution and boosting SOTA [email protected] on ScanRefer/Nr3D/Sr3D by several points (Wang et al., 16 Jan 2025).
Joint multi-task learning: 3DGCTR unifies 3DVG and dense captioning with prompt-based localization, achieving SOTA on ScanRefer (Luo et al., 17 Apr 2024).

Ablation and efficiency analysis:

Memory and multi-view fusion: Ablations in ChangingGrounder show that careful memory utilization and minimal exploration suffice for high accuracy at a fraction of exploration cost ( $C_a$ , $C_m$ ) (Hu et al., 16 Oct 2025).
Contrastive and phrase-level supervision: POA and phrase-specific pre-training systematically raise performance and explainability by encouraging explicit reasoning about every mentioned entity (Yuan et al., 2022).
Challenges in grounding:
- Data scarcity and bias to overfitting on single datasets (Miyanishi et al., 2023).
- Language phenomena such as negation, coreference, numeracy, and compositionality cause marked degradation in current open-vocabulary and LLM-based models (Wang et al., 2 Jan 2025).

6. Limitations, Open Challenges, and Future Directions

Current limitations:

Dataset dependence: Robustness drops precipitously when transferring between data sources, sensors, and annotation styles (Miyanishi et al., 2023).
Language complexity: ViGiL3D demonstrates that state-of-the-art methods—both CLIP-based and LLM-based—lose 20–30 points of accuracy on prompts featuring negation, ordinal references, and multi-anchor context (Wang et al., 2 Jan 2025).
Real-time, streaming, and memory: Most current 3DVG methods presume pre-reconstructed, static scenes, whereas application scenarios demand streaming, memory-centric, and exploration-aware agents (Hu et al., 16 Oct 2025).
Ambiguity and instance similarity: Outdoor LiDAR-based 3DVG and indoor “multiple” splits remain challenging due to dense clutter and category ambiguity; architectural advances like foreground feature selection and ambiguity-aware supervision offer partial relief (Baek et al., 7 Nov 2024).
Modality gap: Even with unified representation spaces, residual alignment gaps between vision and language features impair fine-grained reasoning (Zheng et al., 17 Jun 2025).

Future research directions:

Scalable and linguistically rich datasets: Building 3DVG corpora covering more linguistic phenomena, negative language, and diverse scene types is critical (Wang et al., 2 Jan 2025).
Self-supervised and continual learning: Leveraging masked segmentation, contrastive losses, and self-supervised objectives to make best use of unlabeled, multi-modal 3D corpora (Wang et al., 16 Jan 2025, Unal et al., 2023).
Foundation model adaptation and explicit reasoning: Integrating explicit constraints and spatial priors into foundation model pipelines, with dynamic constraint synthesis and appearance reasoning (Yuan et al., 21 Nov 2024).
Memory, activeness, and embodied reasoning: Further development of agents that coordinate stored memory, targeted exploration, and online interaction for changing or partially observed scenes (Hu et al., 16 Oct 2025).
Multi-modal (audio, vision, robotics): Audio-3DVG and speech-robust techniques are a nascent but growing frontier (Cao-Dinh et al., 1 Jul 2025, Qi et al., 17 Jun 2025).

7. Summary Table: Major Recent Advances

Aspect	Method/Work	Key Contribution(s)
Dataset & Benchmarks	ChangingGrounding (Hu et al., 16 Oct 2025)	Dynamic, memory-driven grounding in changing scenes
Data augmentation	AugRefer (Wang et al., 16 Jan 2025)	Cross-modal synthetic augmentation, language-spatial decoder
Dense mask grounding	ConcreteNet (Unal et al., 2023)	Bottom-up fusion, learned camera tokens, view ensembling
Zero-shot/LLM approaches	SeeGround (Li et al., 5 Dec 2024, Li et al., 28 May 2025), CSVG (Yuan et al., 21 Nov 2024)	2D VLM rendering, constraint-satisfaction reasoning
Multi-view robustness	MVT (Huang et al., 2022), SPAZER (Jin et al., 27 Jun 2025)	Multi-view fusion, agent-based spatial-semantic reasoning
Audio/speech extension	Audio-3DVG (Cao-Dinh et al., 1 Jul 2025), SpeechRefer (Qi et al., 17 Jun 2025)	Explicit speech feature modeling, contrastive alignment
Cross-domain, open-vocab	LidaRefer (Baek et al., 7 Nov 2024), Cross3DVG (Miyanishi et al., 2023)	Outdoor LiDAR, cross-dataset generalization

Developments in 3DVG now span from robust, real-time, end-to-end grounding with deep cross-modal fusion, to zero-shot, modular, and programmatic reasoning exploiting foundation models. The research trajectory increasingly emphasizes multimodal understanding, generalization, interpretability, and practical deployment under real-world constraints.