Zoo3D₀: Zero-Shot 3D Object Detection

Updated 2 December 2025

The paper introduces a training-free, open-vocabulary 3D detection pipeline that uses off-the-shelf 2D models and geometric graph clustering for instance discovery.
It processes multi-view images and point clouds to back-project 2D masks into 3D, constructing axis-aligned bounding boxes without annotated labels.
Experimental evaluations on benchmarks like ScanNet200 and ARKitScenes demonstrate state-of-the-art zero-shot 3D scene understanding across diverse modalities.

Zero-Shot Zoo3D $_0$ is a training-free, open-vocabulary 3D object detection framework that identifies and semantically classifies previously unseen object instances in 3D scenes. Unlike prior methods, Zoo3D $_0$ does not require annotated 3D boxes, semantic labels, or any supervised adaptation at the object or scene level. It leverages off-the-shelf 2D foundation models for mask proposal and open-vocabulary labeling, performing instance grouping and object discovery using geometric graph clustering. Zoo3D $_0$ operates on point clouds, posed images, or even unposed image streams, achieving state-of-the-art results in zero-shot 3D scene understanding across established benchmarks (Lemeshko et al., 25 Nov 2025).

1. Pipeline Overview and Workflow

Zoo3D $_0$ processes multi-view image data or a point cloud to produce 3D bounding boxes with semantic labels entirely at inference time. The key workflow comprises:

Input Modalities: Point clouds $P = \{p_i\}_{i=1}^n \subset \mathbb{R}^3$ , RGB images $\{I_t\}_{t=1}^T$ , optionally depth maps $\{D_t\}$ and camera extrinsics $\{R_t\}$ , intrinsics $K$ . For posed or unposed image streams, 3D geometry is reconstructed using DUSt3R (zero-shot TSDF and pose estimation).
2D Mask Proposal: A class-agnostic 2D mask generator (e.g., SAM 2.1 Hiera-L or MaskClustering’s 2D stage) predicts per-frame object masks $m_{t,i} \subset I_t$ .
3D Back-Projection: Each mask $m_{t,i}$ is lifted to 3D, gathering points $P_{t,i} = \{p \in P \mid \pi(K R_t [p;1]) \in m_{t,i}\}$ , where $\pi([x,y,w]^\top) = (x/w, y/w)$ .
Mask-Graph Construction: Masks are nodes in a graph. Edges are drawn between masks whose 3D projections “agree” according to the view-consensus rate $cr(m, m') \geq \tau_\text{rate}$ .
Graph Clustering: Connected components yield instance hypotheses; each cluster aggregates agreeing 2D masks.
3D Bounding Box Extraction: For each mask cluster $C_g$ , points are aggregated as $S_g = \bigcup_{m \in C_g} P_m$ . Axis-aligned box parameters are determined:

$c_g = \frac{1}{2}(\min_{p \in S_g} p + \max_{p \in S_g} p), \quad s_g = \max_{p \in S_g} p - \min_{p \in S_g} p$

Open-Vocabulary Label Assignment: For each box $b_g = (c_g, s_g)$ , best-view selection, view-consensus mask refinement, and CLIP-based embedding assign a class label via cosine similarity to candidate class names.

This staged, training-free pipeline establishes a new standard for open-vocabulary, zero-shot 3D scene-level detection (Lemeshko et al., 25 Nov 2025).

2. Graph Clustering for Instance Grouping in 3D

Object instance discovery across image views is achieved by constructing and clustering a mask agreement graph. Nodes represent 2D masks projected to 3D. Edges are present only if masks agree, measured by the view-consensus rate:

$cr(m, m') = \frac{|\{ \text{supporter frames: both masks' 3D votes are included in a single mask} \}|}{|\{ \text{observer frames: both masks visible} \}|} = \frac{|M(m) \cap M(m')|}{|F(m) \cap F(m')|}$

An affinity matrix $A \in \mathbb{R}^{M \times M}$ encodes these pairwise agreements:

$A_{ij} = \begin{cases} cr(m_i, m_j) & \text{if } cr(m_i, m_j) \geq \tau_\text{rate} \ 0 & \text{otherwise} \end{cases}$

with $\tau_\text{rate}=0.9$ .

Spectral clustering is formulated with $D_{ii} = \sum_j A_{ij}$ and Laplacian $L = D - A$ . Although eigen-decomposition is suggested, in practice, thresholded connected components suffice at high $\tau_\text{rate}$ .

Clusters $C_g$ correspond to multi-view-consistent instance hypotheses (Lemeshko et al., 25 Nov 2025).

3. 3D Bounding Box Construction from Clustered Masks

For every cluster $C_g$ , representing an object hypothesis, the pipeline aggregates all supporting 3D points:

$S_g = \bigcup_{m \in C_g} P_m$

Axis-aligned boxes are computed as:

$c_g = \frac{1}{2}\left(\min_{p \in S_g} p + \max_{p \in S_g} p\right) \,, \quad s_g = \max_{p \in S_g} p - \min_{p \in S_g} p$

No pose regression beyond axis alignment is performed; thus, orientational subtleties are not captured. This box extraction tightly encloses the detected instance according to the observed spatial extent (Lemeshko et al., 25 Nov 2025).

4. Open-Vocabulary Semantic Labeling Mechanism

Semantic labeling is performed using foundation vision-LLMs. The process includes:

Best-View Selection: For each 3D cluster, points $P_g$ within the box are projected into all images. Occlusion filtering retains only points visible in each view, where for projected coordinate $u_{t,i}$ :

$a_{t,i} = K^{-1}[u_{t,i};1] \cdot D_t(u_{t,i})$

Accepts those with $||p_i - (R_t^{-1}[a_{t,i};1]_{xyz})|| < \tau_{occ}$ . The $k=5$ views maximizing visible points $|U_g^t|$ are kept.

Mask Refinement and Multi-Scale Cropping: Refined masks $\hat{m}^t$ are produced for each selected view using SAM, cropped at three scales $\alpha \in \{0.8, 1.0, 1.2\}$ .
CLIP Embedding and Classification: For each crop, the image embedding $f_{t,\alpha}$ is computed by CLIP ViT-H/14, averaged to $f_g$ . Candidate label embeddings $e_j$ are also computed. Similarity is

$s_j = \cos(f_g, e_j) = \frac{f_g \cdot e_j}{\|f_g\| \|e_j\|}$

The class with maximal $s_j$ is assigned to the 3D box (Lemeshko et al., 25 Nov 2025).

5. Implementation Details and Hyperparameters

Relevant implementation aspects and parameters include:

Component	Value / Model Used	Notes
2D Masking	SAM 2.1 (Hiera-L)/MaskClustering	Class-agnostic, foundation models
3D Recon (unposed)	DUSt3R	Zero-shot 2D $\to$ 3D front-end
Vision-Language	CLIP ViT-H/14	Embedding for open-vocabulary transfer
Mask-Graph $\tau_\text{rate}$	0.9	High threshold for consensus
Occlusion threshold	$\tau_{occ}$ (e.g., 2 cm)	Filters distant back-projections
Best views $k$	5	Used in semantic labeling
Cropping scales	$\alpha \in \{0.8, 1.0, 1.2\}$	Multi-scale crop for embedding
Voxel size	2 cm	Used in DUSt3R/box generation
Non-max suppression	IoU=0.5	Used post-detection

The pipeline leverages off-the-shelf models without additional 3D training and inherits key parameters from its constituent modules (Lemeshko et al., 25 Nov 2025).

6. Experimental Evaluation

Zoo3D $_0$ sets a new benchmark in zero-shot 3D detection:

ScanNet200 (Unseen 200 Classes)
- Point-cloud + posed images: $\text{mAP}_{25}=21.1$ , $\text{mAP}_{50}=14.1$
- Posed images only: $\text{mAP}_{25}=14.3$ , $\text{mAP}_{50}=6.2$
- Unposed images only: $\text{mAP}_{25}=8.3$ , $\text{mAP}_{50}=2.9$
ARKitScenes (17 Classes)
- Zoo3D $_0$ : $\text{mAP}_{25}=34.7$ , $\text{mAP}_{50}=23.9$ (point-cloud input)

Comparisons demonstrate that Zoo3D $_0$ exceeds all prior self-supervised and many supervised methods on common open-vocabulary 3D detection tasks. The improvement when moving from unposed images to posed/image+point cloud settings quantifies the importance of geometric alignment in this zero-shot context (Lemeshko et al., 25 Nov 2025).

7. Strengths, Limitations, and Prospective Extensions

Zoo3D $_0$ achieves true zero-shot 3D object detection—requiring no annotated boxes or scene-level training and functioning across point clouds, posed, and unposed image streams. It utilizes 2D and vision-language foundation models as black-box components, offering strong cross-modality generalization.

Key limitations include high inference cost due to graph clustering and dense 3D reconstruction, restriction to axis-aligned boxes (no orientation regression), and dependence on the quality of upstream mask and vision-language encoders. Failure modes are most evident under heavy occlusion or for objects outside the training distributions of the underlying 2D/CLIP models.

Potential research directions include development of end-to-end, lightweight clustering modules; integration of orientation regression techniques, possibly guided by scene priors or LLMs; acceleration via advanced SLAM or NeRF-based pipelines; and new architectures fusing 2D-3D cues earlier in the detection stack (Lemeshko et al., 25 Nov 2025).

In summary, Zoo3D $_0$ demonstrates that robust open-vocabulary 3D object detection at the scene level is achievable by compositing existing foundation models with principled mask-graph clustering and semantic fusion, all without explicit 3D training.

PDF Markdown Chat (Pro)

References (1)

Zoo3D: Zero-Shot 3D Object Detection at Scene Level (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Zero-Shot Zoo3D$_0$.