Papers
Topics
Authors
Recent
2000 character limit reached

Zoo3D₀: Zero-Shot 3D Object Detection

Updated 2 December 2025
  • The paper introduces a training-free, open-vocabulary 3D detection pipeline that uses off-the-shelf 2D models and geometric graph clustering for instance discovery.
  • It processes multi-view images and point clouds to back-project 2D masks into 3D, constructing axis-aligned bounding boxes without annotated labels.
  • Experimental evaluations on benchmarks like ScanNet200 and ARKitScenes demonstrate state-of-the-art zero-shot 3D scene understanding across diverse modalities.

Zero-Shot Zoo3D0_0 is a training-free, open-vocabulary 3D object detection framework that identifies and semantically classifies previously unseen object instances in 3D scenes. Unlike prior methods, Zoo3D0_0 does not require annotated 3D boxes, semantic labels, or any supervised adaptation at the object or scene level. It leverages off-the-shelf 2D foundation models for mask proposal and open-vocabulary labeling, performing instance grouping and object discovery using geometric graph clustering. Zoo3D0_0 operates on point clouds, posed images, or even unposed image streams, achieving state-of-the-art results in zero-shot 3D scene understanding across established benchmarks (Lemeshko et al., 25 Nov 2025).

1. Pipeline Overview and Workflow

Zoo3D0_0 processes multi-view image data or a point cloud to produce 3D bounding boxes with semantic labels entirely at inference time. The key workflow comprises:

  • Input Modalities: Point clouds P={pi}i=1nR3P = \{p_i\}_{i=1}^n \subset \mathbb{R}^3, RGB images {It}t=1T\{I_t\}_{t=1}^T, optionally depth maps {Dt}\{D_t\} and camera extrinsics {Rt}\{R_t\}, intrinsics KK. For posed or unposed image streams, 3D geometry is reconstructed using DUSt3R (zero-shot TSDF and pose estimation).
  • 2D Mask Proposal: A class-agnostic 2D mask generator (e.g., SAM 2.1 Hiera-L or MaskClustering’s 2D stage) predicts per-frame object masks mt,iItm_{t,i} \subset I_t.
  • 3D Back-Projection: Each mask mt,im_{t,i} is lifted to 3D, gathering points Pt,i={pPπ(KRt[p;1])mt,i}P_{t,i} = \{p \in P \mid \pi(K R_t [p;1]) \in m_{t,i}\}, where π([x,y,w])=(x/w,y/w)\pi([x,y,w]^\top) = (x/w, y/w).
  • Mask-Graph Construction: Masks are nodes in a graph. Edges are drawn between masks whose 3D projections “agree” according to the view-consensus rate cr(m,m)τratecr(m, m') \geq \tau_\text{rate}.
  • Graph Clustering: Connected components yield instance hypotheses; each cluster aggregates agreeing 2D masks.
  • 3D Bounding Box Extraction: For each mask cluster CgC_g, points are aggregated as Sg=mCgPmS_g = \bigcup_{m \in C_g} P_m. Axis-aligned box parameters are determined:

cg=12(minpSgp+maxpSgp),sg=maxpSgpminpSgpc_g = \frac{1}{2}(\min_{p \in S_g} p + \max_{p \in S_g} p), \quad s_g = \max_{p \in S_g} p - \min_{p \in S_g} p

  • Open-Vocabulary Label Assignment: For each box bg=(cg,sg)b_g = (c_g, s_g), best-view selection, view-consensus mask refinement, and CLIP-based embedding assign a class label via cosine similarity to candidate class names.

This staged, training-free pipeline establishes a new standard for open-vocabulary, zero-shot 3D scene-level detection (Lemeshko et al., 25 Nov 2025).

2. Graph Clustering for Instance Grouping in 3D

Object instance discovery across image views is achieved by constructing and clustering a mask agreement graph. Nodes represent 2D masks projected to 3D. Edges are present only if masks agree, measured by the view-consensus rate:

cr(m,m)={supporter frames: both masks’ 3D votes are included in a single mask}{observer frames: both masks visible}=M(m)M(m)F(m)F(m)cr(m, m') = \frac{|\{ \text{supporter frames: both masks' 3D votes are included in a single mask} \}|}{|\{ \text{observer frames: both masks visible} \}|} = \frac{|M(m) \cap M(m')|}{|F(m) \cap F(m')|}

An affinity matrix ARM×MA \in \mathbb{R}^{M \times M} encodes these pairwise agreements:

Aij={cr(mi,mj)if cr(mi,mj)τrate 0otherwiseA_{ij} = \begin{cases} cr(m_i, m_j) & \text{if } cr(m_i, m_j) \geq \tau_\text{rate} \ 0 & \text{otherwise} \end{cases}

with τrate=0.9\tau_\text{rate}=0.9.

Spectral clustering is formulated with Dii=jAijD_{ii} = \sum_j A_{ij} and Laplacian L=DAL = D - A. Although eigen-decomposition is suggested, in practice, thresholded connected components suffice at high τrate\tau_\text{rate}.

Clusters CgC_g correspond to multi-view-consistent instance hypotheses (Lemeshko et al., 25 Nov 2025).

3. 3D Bounding Box Construction from Clustered Masks

For every cluster CgC_g, representing an object hypothesis, the pipeline aggregates all supporting 3D points:

Sg=mCgPmS_g = \bigcup_{m \in C_g} P_m

Axis-aligned boxes are computed as:

cg=12(minpSgp+maxpSgp),sg=maxpSgpminpSgpc_g = \frac{1}{2}\left(\min_{p \in S_g} p + \max_{p \in S_g} p\right) \,, \quad s_g = \max_{p \in S_g} p - \min_{p \in S_g} p

No pose regression beyond axis alignment is performed; thus, orientational subtleties are not captured. This box extraction tightly encloses the detected instance according to the observed spatial extent (Lemeshko et al., 25 Nov 2025).

4. Open-Vocabulary Semantic Labeling Mechanism

Semantic labeling is performed using foundation vision-LLMs. The process includes:

  • Best-View Selection: For each 3D cluster, points PgP_g within the box are projected into all images. Occlusion filtering retains only points visible in each view, where for projected coordinate ut,iu_{t,i}:

at,i=K1[ut,i;1]Dt(ut,i)a_{t,i} = K^{-1}[u_{t,i};1] \cdot D_t(u_{t,i})

Accepts those with pi(Rt1[at,i;1]xyz)<τocc||p_i - (R_t^{-1}[a_{t,i};1]_{xyz})|| < \tau_{occ}. The k=5k=5 views maximizing visible points Ugt|U_g^t| are kept.

  • Mask Refinement and Multi-Scale Cropping: Refined masks m^t\hat{m}^t are produced for each selected view using SAM, cropped at three scales α{0.8,1.0,1.2}\alpha \in \{0.8, 1.0, 1.2\}.
  • CLIP Embedding and Classification: For each crop, the image embedding ft,αf_{t,\alpha} is computed by CLIP ViT-H/14, averaged to fgf_g. Candidate label embeddings eje_j are also computed. Similarity is

sj=cos(fg,ej)=fgejfgejs_j = \cos(f_g, e_j) = \frac{f_g \cdot e_j}{\|f_g\| \|e_j\|}

The class with maximal sjs_j is assigned to the 3D box (Lemeshko et al., 25 Nov 2025).

5. Implementation Details and Hyperparameters

Relevant implementation aspects and parameters include:

Component Value / Model Used Notes
2D Masking SAM 2.1 (Hiera-L)/MaskClustering Class-agnostic, foundation models
3D Recon (unposed) DUSt3R Zero-shot 2D\to3D front-end
Vision-Language CLIP ViT-H/14 Embedding for open-vocabulary transfer
Mask-Graph τrate\tau_\text{rate} 0.9 High threshold for consensus
Occlusion threshold τocc\tau_{occ} (e.g., 2 cm) Filters distant back-projections
Best views kk 5 Used in semantic labeling
Cropping scales α{0.8,1.0,1.2}\alpha \in \{0.8, 1.0, 1.2\} Multi-scale crop for embedding
Voxel size 2 cm Used in DUSt3R/box generation
Non-max suppression IoU=0.5 Used post-detection

The pipeline leverages off-the-shelf models without additional 3D training and inherits key parameters from its constituent modules (Lemeshko et al., 25 Nov 2025).

6. Experimental Evaluation

Zoo3D0_0 sets a new benchmark in zero-shot 3D detection:

  • ScanNet200 (Unseen 200 Classes)
    • Point-cloud + posed images: mAP25=21.1\text{mAP}_{25}=21.1, mAP50=14.1\text{mAP}_{50}=14.1
    • Posed images only: mAP25=14.3\text{mAP}_{25}=14.3, mAP50=6.2\text{mAP}_{50}=6.2
    • Unposed images only: mAP25=8.3\text{mAP}_{25}=8.3, mAP50=2.9\text{mAP}_{50}=2.9
  • ARKitScenes (17 Classes)
    • Zoo3D0_0: mAP25=34.7\text{mAP}_{25}=34.7, mAP50=23.9\text{mAP}_{50}=23.9 (point-cloud input)

Comparisons demonstrate that Zoo3D0_0 exceeds all prior self-supervised and many supervised methods on common open-vocabulary 3D detection tasks. The improvement when moving from unposed images to posed/image+point cloud settings quantifies the importance of geometric alignment in this zero-shot context (Lemeshko et al., 25 Nov 2025).

7. Strengths, Limitations, and Prospective Extensions

Zoo3D0_0 achieves true zero-shot 3D object detection—requiring no annotated boxes or scene-level training and functioning across point clouds, posed, and unposed image streams. It utilizes 2D and vision-language foundation models as black-box components, offering strong cross-modality generalization.

Key limitations include high inference cost due to graph clustering and dense 3D reconstruction, restriction to axis-aligned boxes (no orientation regression), and dependence on the quality of upstream mask and vision-language encoders. Failure modes are most evident under heavy occlusion or for objects outside the training distributions of the underlying 2D/CLIP models.

Potential research directions include development of end-to-end, lightweight clustering modules; integration of orientation regression techniques, possibly guided by scene priors or LLMs; acceleration via advanced SLAM or NeRF-based pipelines; and new architectures fusing 2D-3D cues earlier in the detection stack (Lemeshko et al., 25 Nov 2025).

In summary, Zoo3D0_0 demonstrates that robust open-vocabulary 3D object detection at the scene level is achievable by compositing existing foundation models with principled mask-graph clustering and semantic fusion, all without explicit 3D training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Zero-Shot Zoo3D$_0$.