Zoo3D₀: Zero-Shot 3D Object Detection
- The paper introduces a training-free, open-vocabulary 3D detection pipeline that uses off-the-shelf 2D models and geometric graph clustering for instance discovery.
- It processes multi-view images and point clouds to back-project 2D masks into 3D, constructing axis-aligned bounding boxes without annotated labels.
- Experimental evaluations on benchmarks like ScanNet200 and ARKitScenes demonstrate state-of-the-art zero-shot 3D scene understanding across diverse modalities.
Zero-Shot Zoo3D is a training-free, open-vocabulary 3D object detection framework that identifies and semantically classifies previously unseen object instances in 3D scenes. Unlike prior methods, Zoo3D does not require annotated 3D boxes, semantic labels, or any supervised adaptation at the object or scene level. It leverages off-the-shelf 2D foundation models for mask proposal and open-vocabulary labeling, performing instance grouping and object discovery using geometric graph clustering. Zoo3D operates on point clouds, posed images, or even unposed image streams, achieving state-of-the-art results in zero-shot 3D scene understanding across established benchmarks (Lemeshko et al., 25 Nov 2025).
1. Pipeline Overview and Workflow
Zoo3D processes multi-view image data or a point cloud to produce 3D bounding boxes with semantic labels entirely at inference time. The key workflow comprises:
- Input Modalities: Point clouds , RGB images , optionally depth maps and camera extrinsics , intrinsics . For posed or unposed image streams, 3D geometry is reconstructed using DUSt3R (zero-shot TSDF and pose estimation).
- 2D Mask Proposal: A class-agnostic 2D mask generator (e.g., SAM 2.1 Hiera-L or MaskClustering’s 2D stage) predicts per-frame object masks .
- 3D Back-Projection: Each mask is lifted to 3D, gathering points , where .
- Mask-Graph Construction: Masks are nodes in a graph. Edges are drawn between masks whose 3D projections “agree” according to the view-consensus rate .
- Graph Clustering: Connected components yield instance hypotheses; each cluster aggregates agreeing 2D masks.
- 3D Bounding Box Extraction: For each mask cluster , points are aggregated as . Axis-aligned box parameters are determined:
- Open-Vocabulary Label Assignment: For each box , best-view selection, view-consensus mask refinement, and CLIP-based embedding assign a class label via cosine similarity to candidate class names.
This staged, training-free pipeline establishes a new standard for open-vocabulary, zero-shot 3D scene-level detection (Lemeshko et al., 25 Nov 2025).
2. Graph Clustering for Instance Grouping in 3D
Object instance discovery across image views is achieved by constructing and clustering a mask agreement graph. Nodes represent 2D masks projected to 3D. Edges are present only if masks agree, measured by the view-consensus rate:
An affinity matrix encodes these pairwise agreements:
with .
Spectral clustering is formulated with and Laplacian . Although eigen-decomposition is suggested, in practice, thresholded connected components suffice at high .
Clusters correspond to multi-view-consistent instance hypotheses (Lemeshko et al., 25 Nov 2025).
3. 3D Bounding Box Construction from Clustered Masks
For every cluster , representing an object hypothesis, the pipeline aggregates all supporting 3D points:
Axis-aligned boxes are computed as:
No pose regression beyond axis alignment is performed; thus, orientational subtleties are not captured. This box extraction tightly encloses the detected instance according to the observed spatial extent (Lemeshko et al., 25 Nov 2025).
4. Open-Vocabulary Semantic Labeling Mechanism
Semantic labeling is performed using foundation vision-LLMs. The process includes:
- Best-View Selection: For each 3D cluster, points within the box are projected into all images. Occlusion filtering retains only points visible in each view, where for projected coordinate :
Accepts those with . The views maximizing visible points are kept.
- Mask Refinement and Multi-Scale Cropping: Refined masks are produced for each selected view using SAM, cropped at three scales .
- CLIP Embedding and Classification: For each crop, the image embedding is computed by CLIP ViT-H/14, averaged to . Candidate label embeddings are also computed. Similarity is
The class with maximal is assigned to the 3D box (Lemeshko et al., 25 Nov 2025).
5. Implementation Details and Hyperparameters
Relevant implementation aspects and parameters include:
| Component | Value / Model Used | Notes |
|---|---|---|
| 2D Masking | SAM 2.1 (Hiera-L)/MaskClustering | Class-agnostic, foundation models |
| 3D Recon (unposed) | DUSt3R | Zero-shot 2D3D front-end |
| Vision-Language | CLIP ViT-H/14 | Embedding for open-vocabulary transfer |
| Mask-Graph | 0.9 | High threshold for consensus |
| Occlusion threshold | (e.g., 2 cm) | Filters distant back-projections |
| Best views | 5 | Used in semantic labeling |
| Cropping scales | Multi-scale crop for embedding | |
| Voxel size | 2 cm | Used in DUSt3R/box generation |
| Non-max suppression | IoU=0.5 | Used post-detection |
The pipeline leverages off-the-shelf models without additional 3D training and inherits key parameters from its constituent modules (Lemeshko et al., 25 Nov 2025).
6. Experimental Evaluation
Zoo3D sets a new benchmark in zero-shot 3D detection:
- ScanNet200 (Unseen 200 Classes)
- Point-cloud + posed images: ,
- Posed images only: ,
- Unposed images only: ,
- ARKitScenes (17 Classes)
- Zoo3D: , (point-cloud input)
Comparisons demonstrate that Zoo3D exceeds all prior self-supervised and many supervised methods on common open-vocabulary 3D detection tasks. The improvement when moving from unposed images to posed/image+point cloud settings quantifies the importance of geometric alignment in this zero-shot context (Lemeshko et al., 25 Nov 2025).
7. Strengths, Limitations, and Prospective Extensions
Zoo3D achieves true zero-shot 3D object detection—requiring no annotated boxes or scene-level training and functioning across point clouds, posed, and unposed image streams. It utilizes 2D and vision-language foundation models as black-box components, offering strong cross-modality generalization.
Key limitations include high inference cost due to graph clustering and dense 3D reconstruction, restriction to axis-aligned boxes (no orientation regression), and dependence on the quality of upstream mask and vision-language encoders. Failure modes are most evident under heavy occlusion or for objects outside the training distributions of the underlying 2D/CLIP models.
Potential research directions include development of end-to-end, lightweight clustering modules; integration of orientation regression techniques, possibly guided by scene priors or LLMs; acceleration via advanced SLAM or NeRF-based pipelines; and new architectures fusing 2D-3D cues earlier in the detection stack (Lemeshko et al., 25 Nov 2025).
In summary, Zoo3D demonstrates that robust open-vocabulary 3D object detection at the scene level is achievable by compositing existing foundation models with principled mask-graph clustering and semantic fusion, all without explicit 3D training.