GOV-3D: Open-Vocabulary 3D Scene Understanding
- The paper demonstrates a method that integrates 2D vision-language features into 3D representations, enabling robust open-vocabulary segmentation and precise attribute alignment.
- GOV-3D is a framework that maps 3D point clouds and RGB images into a shared embedding space, supporting arbitrary linguistic queries including attributes, affordances, and relations.
- This approach leverages multi-view fusion, codebook quantization, and contrastive losses to overcome domain shifts and improve cross-domain generalization.
Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) is a class of methods and benchmarks that target the comprehensive interpretation, segmentation, and retrieval of 3D scene elements through open-ended natural language queries beyond closed-set object classes. GOV-3D extends traditional 3D scene understanding by supporting arbitrary linguistic cues, encompassing attribute, affordance, component, abstract, and fine-grained semantic queries, with an emphasis on generality, cross-domain transfer, and zero-shot capability.
1. Formalization and Problem Definition
GOV-3D is defined as the task of mapping a 3D scene—most commonly specified as a point cloud and associated RGB images —to open-vocabulary semantic masks, segmentations, or retrieval outputs based on arbitrary attribute queries expressed as free-form language. The ground-truth provides object or region masks and their attributes (e.g., affordance, material, element, synonym) (Zhao et al., 2024).
Mathematically, a GOV-3D model outputs:
where each prediction in associates a subset of the 3D scene to an open-vocabulary attribute, with evaluation via mIoU, mAcc, AP, or task-specific criteria, depending on the benchmark and query aspect.
The distinguishing characteristic is that the query space is not constrained to closed-set categories but spans arbitrary linguistic aspects: affordance (“sit-able”), property (“soft”), type (“communication device”), material (“wood”), component (“two wheels”), or relation (“to the left of the table”) (Zhao et al., 2024, Yu et al., 8 Nov 2025).
2. Core Methodological Paradigms
GOV-3D systems employ multi-modal fusion, vision-LLMs (VLMs), and geometric reasoning to transfer open-vocabulary competence from 2D domains into 3D. Prominent approaches include:
- Knowledge Distillation from 2D VLMs: Distill CLIP-aligned features or 2D segmentation network outputs into 3D point features (e.g., OpenSeg/LSeg into sparse convnets or NeRF backbones) (Peng et al., 2022, Guo et al., 2024, Jiang et al., 2024, Li et al., 2024).
- 3D Gaussian Splatting & Quantization: Model the scene with anisotropic Gaussians, using compact quantization schemes or codebooks to inject language alignment without excessive memory cost (Shi et al., 2023, Guo et al., 2024, Chen et al., 2024, Wang et al., 26 Jul 2025, Alegret et al., 19 Aug 2025).
- Geometric Priors & Multi-View Fusion: Exploit 3D superpoints, cost volumes, or superpoint-based voting to impose geometric smoothing and resolve inconsistencies in 2D-to-3D semantic transfer (Wang et al., 2024, Yin et al., 28 Jun 2025, Li et al., 2024).
- Cross-modal Consistency: Employ cross-modal contrastive or feature matching losses, often at multiple levels (pixel, region, instance, scene), to align 2D, 3D, and language features in a unified embedding space (Ding et al., 2022, Wang et al., 28 Apr 2025, Chen et al., 2024, Alegret et al., 19 Aug 2025).
- Explicit Attribute and Multi-Aspect Querying: Make use of attribute-enriched datasets and benchmarks (OpenScan), supporting evaluation and supervision over abstract queries (Zhao et al., 2024).
The following table summarizes representative GOV-3D paradigms:
| Method/Framework | Representation | VLM Integration | Geometric Prior |
|---|---|---|---|
| OpenScene (Peng et al., 2022) | Sparse 3D ConvNet | CLIP / OpenSeg distillation | 3D superpoints, multi-view |
| Semantic Gaussians (Guo et al., 2024) | 3D Gaussian Splatting | 2D VLM feature projection | SparseConv 3D backbone |
| OpenOcc (Jiang et al., 2024) | NeRF with occupancy grid | OpenSeg/CLIP-to-3D language field | Occupancy surface + SCP |
| DMA (Li et al., 2024) | Dense 3D–2D Alignment | Dual-path CLIP/mask fusion | Multi-view pixel-point |
| GALA (Alegret et al., 19 Aug 2025) | 3DGS + codebooks | CLIP, SAM cross-attn | MLP-guided codebooks |
| OpenUrban3D (Wang et al., 13 Sep 2025) | 3D point cloud | CLIP via 2D proxies | Multi-granularity rendering & fusion |
These paradigms are unified by (1) decoupling the 3D network from fixed label sets, (2) mapping both points and language into a shared metric space, and (3) leveraging advanced rendering, fusion, or smoothing to address multi-view inconsistency and geometric noise.
3. Key Datasets and Benchmarks
GOV-3D research is driven by both legacy and purpose-built datasets. Notable resources include:
- OpenScan (Zhao et al., 2024): Extends ScanNet200 with 341 fine-grained object attributes across eight linguistic aspects, providing 153,644 attribute-object annotations in 1,513 scenes. Metrics include mIoU, mAcc, and AP for both semantic and instance segmentation, evaluated on attribute queries (not just classes).
- PT-OVS (Wang et al., 26 Jul 2025): Designed for open-vocabulary segmentation on unconstrained Internet photo collections, supports in-the-wild evaluation of long-tail structural queries.
- SegGaussian (Chen et al., 2024): 3DGS-based large-scale annotated dataset for cross-scene and cross-domain generalization.
- SensatUrban, SUM (Wang et al., 13 Sep 2025): Annotation-free large-scale urban benchmarks, supporting zero-shot labeling and cross-city transfer.
- ScanNet-200, Matterport3D, Replica, Mip-NeRF360: Widely adopted indoor/augmented synthetic datasets for segmentation, reconstruction, and novel-view synthesis tasks.
These benchmarks are characterized by the diversity of vocabulary, the prominence of attribute-level ground truth, and an emphasis on open-world, zero-shot evaluation. On OpenScan, state-of-the-art OV-3D methods achieve mIoU values <2% and AP values <16% on attributes—demonstrating the hardness of GOV-3D compared to class-based tasks, where mIoU routinely exceeds 45% (Zhao et al., 2024).
4. Semantic Transfer and Architectural Innovations
Distillation of 2D VLMs into 3D
- OpenOcc (Jiang et al., 2024): Distills frozen 2D language features into an implicit 3D language field parameterized by a semantic grid and decoder. A volume rendering process aligns per-ray semantic signals via a matching loss; refinement is achieved by semantic-aware confidence propagation (SCP) based on Bayesian log-odds fusion to address view inconsistency.
- GOV-NeSF (Wang et al., 2024): Aggregates multi-view 2D features with geometry-aware cost volumes and fuses them through a cross-view attention mechanism for color and semantic blending, yielding a generalizable neural field for zero-shot segmentation without semantic or depth labels.
- Semantic Gaussians (Guo et al., 2024): Projects dense 2D VLM features into each Gaussian in a 3DGS model, then trains a sparse 3D network to predict the corresponding CLIP-aligned semantics, improving efficiency and supporting object part, material, and affordance queries.
Quantization and Codebook Approaches
- Language Embedded Gaussians (Shi et al., 2023): Applies codebook-based quantization of hybrid CLIP-DINO features per Gaussian, with adaptive smoothing to combat high-frequency bias and noise, achieving high mIoU and fast querying.
- OVGaussian (Chen et al., 2024): Generalizes the codebook approach across scenes via Generalizable Semantic Rasterization (GSR) and fused cross-modal consistency losses.
Alignment and Fusion Strategies
- Dense Multimodal Alignment (DMA) (Li et al., 2024): Aligns 3D, 2D, and text via four inclusive losses—point–text, point–2D, pixel–text, and point–caption—anchoring features in CLIP space and supporting multi-label segmentation.
- MVOV3D (Yin et al., 28 Jun 2025): Enables geometry-aware multi-view fusion by correcting VLM noise with per-view region refinement, leveraging region- and caption-level CLIP encodings, and suiting open-world categories.
Cross-Modal Consistency and Self-Distillation
- Geometry Guided Self-Distillation (GGSD) (Wang et al., 2024): Enforces 2D-3D alignment first at point and geometric superpoint level, then improves robustness via 3D self-distillation using EMA teachers and superpoint voting, surpassing the 2D teacher's accuracy.
Instance-Level, Attribute, and Graph Reasoning
- Bag-of-Embeddings (Arafa et al., 16 Sep 2025): Groups Gaussians at the object level, aggregating multi-view CLIP features per group for attribute and semantic queries via bag similarity, which avoids blending artifacts inherent in alpha compositing.
- Scene Graph Generation (Yu et al., 8 Nov 2025): Constructs open-vocabulary 3D scene graphs with relationships inferred via VLMs, supporting reasoning tasks such as retrieval-augmented question answering and planning via LLMs.
5. Quantitative Results and Generalization
GOV-3D models demonstrate resilience to open-world and zero-shot settings, outperform closed-set or narrow-vocabulary baselines, and exhibit improved long-tail and attribute coverage relative to prior art.
- OpenOcc (Jiang et al., 2024): On ScanNet-200, achieves 17.5% mIoU vs. 13.3% (OpenScene), with large gains in small/long-tail classes; on Replica, mIoU=50.5%.
- OVGaussian (Chen et al., 2024): Cross-scene mIoU: 43.84% (3D CSA), open-vocab accuracy: 15.24% (3D OVA), cross-domain accuracy: 18.93% (3D CDA), all outperforming baselines.
- MVOV3D (Yin et al., 28 Jun 2025): On ScanNet-200, attains 14.7% mIoU (compared to best trained 3D baseline ∼7.9%).
- DMA (Li et al., 2024): On ScanNet, moves mIoU from 47.9% (OpenScene-2D3D) to 53.3%, with especially large improvements when using mutually-inclusive BCE over CE, and outperforms on long-tail categories.
- GOI (Qu et al., 2024): On Mip-NeRF360, achieves mIoU ≈ 0.865 vs. prior best 0.555, a +30 pp improvement; on Replica, 0.617 vs. 0.470.
- OpenUrban3D (Wang et al., 13 Sep 2025): Annotation-free, zero-shot, achieves mIoU=39.6% on SensatUrban (vs. prior max 12.6% for OpenScene), mIoU=75.4% on SUM (vs. 41.1%).
- OpenScan (Zhao et al., 2024): Leading open-vocabulary methods obtain AP ≈ 9.9–15.8, mIoU ≈ 0.45 on attributes (vs. >47 on classes)—demonstrating the substantial unresolved gap for fine-grained attribute queries.
Ablation studies consistently show that geometry-guided fusion, cross-modal alignment, uncertainty correction, and codebook quantization each significantly enhance robustness and generalization. GOV-3D frameworks routinely outperform closed-set or naively distillated baselines on rare, fine-grained, and attribute-based splits.
6. Limitations, Challenges, and Future Directions
Despite their advances, current GOV-3D techniques exhibit several limitations:
- Attribute and Abstract Query Limitations: Existing vision-LLMs, including CLIP, perform poorly on abstract, relation-based, or commonsense queries (e.g., “used for cutting”), with low image-text similarity attenuating zero-shot accuracy (Zhao et al., 2024, Li et al., 2024).
- Dependency on 2D Priors: Distillation performance bottlenecked by the representational power and biases of 2D VLMs; misalignments and inadequate coverage persist for complex 3D environments and outdoor, large-scale, or LiDAR-only data (Guo et al., 2024, Chen et al., 2024, Wang et al., 13 Sep 2025).
- Scalability and Efficiency: Methods employing explicit per-point or per-Gaussian semantic fields face memory challenges (>100k–1M primitives) and require careful quantization, codebook, or instance grouping to remain tractable (Shi et al., 2023, Alegret et al., 19 Aug 2025, Chen et al., 2024).
- Generalization and Domain Shift: While cross-scene and cross-domain transfer substantially surpasses prior art, large drop-offs persist in cross-city or outdoor-vs-indoor transfer when scene statistics diverge or photometric cues are insufficient (Wang et al., 13 Sep 2025, Wang et al., 28 Apr 2025).
Promising future directions include:
- Integration of Enhanced Vision-Language Backbones: Incorporating stronger per-pixel VLMs, attribute-aware or graph-structured language priors, and multimodal LLMs for richer context integration (Li et al., 2024, Chen et al., 2024, Zhao et al., 2024).
- Hierarchical, Adaptive, or Dynamic Representations: Use of hierarchical/adaptive grids, scene graph vectors, or dynamic codebooks to improve spatial resolution and reasoning (Jiang et al., 2024, Chen et al., 2024, Yu et al., 8 Nov 2025).
- Self-Supervised and Weakly Supervised Strategies: Reduce dependency on pre-trained 2D models or dense annotations via weak supervision, multi-task learning, or extensive text-image mining (Wang et al., 13 Sep 2025).
- Extended Modalities and Tasks: Extending GOV-3D to interactive editing, panoptic/part segmentation, captioning, reasoning, and robotics via real-time retrieval-augmented LLMs (Yu et al., 8 Nov 2025, Jiang et al., 2024, Guo et al., 2024).
- Compositional and Adaptive Query Handling: Leverage LLMs to rewrite or adapt attribute queries for improved alignment and coverage (Zhao et al., 2024).
7. Impact and Outlook
GOV-3D establishes a new paradigm for scene understanding—one that removes the constraints of fixed label sets and manual annotation, instead leveraging vision-language transfer, geometric learning, and explicit or implicit 3D representations to answer arbitrary, open-world linguistic queries about real and synthetic scenes. Benchmarks such as OpenScan highlight the profound gap between closed-set and genuine open-set 3D understanding, especially for attributes, relations, and affordances.
Current methods, such as OpenOcc (Jiang et al., 2024), OVGaussian (Chen et al., 2024), DMA (Li et al., 2024), OpenUrban3D (Wang et al., 13 Sep 2025), and others, demonstrate robust gains in generalization, cross-domain transfer, and scene-level reasoning, but attribute-centric GOV-3D—especially for abstract, functional, or compositional queries—remains unsolved. Continued progress hinges on joint advances in scene representation, vision-language alignment, and scalable, adaptive architectures that can bridge the semantic richness of human language with the geometric complexity of the physical world.