Zoo3D1: Self-Supervised 3D Detection

Updated 2 December 2025

The paper introduces a self-supervised 3D detection framework that replaces manual annotations with pseudo-labels generated by a preceding zero-shot pipeline.
It employs a two-stage pipeline with mask-graph clustering and a modified TR3D detector to effectively fuse point clouds and RGB–D imagery for scene-level detection.
Quantitative results on ScanNet benchmarks demonstrate improved mAP scores, validating the scalability and robustness of iterative label-free refinement.

Self-supervised Zoo3D $_1$ refers to a self-supervised 3D object detection framework situated within the broader Zoo3D family, whose chief innovation is performing scene-level 3D object detection without reliance on direct human-supplied semantic labels or 3D box annotations. Zoo3D $_1$ operates by leveraging high-quality pseudo ground-truth labels generated by a preceding zero-shot pipeline (Zoo3D $_0$ ), and refines detection performance via conventional detection model training using only these pseudo-labels. It establishes the feasibility of iterative, label-free refinement of open-vocabulary 3D detection in real-world, scene-scale data, supporting modalities such as point clouds, posed, and unposed RGB or RGB–D imagery, and sets new benchmarks on challenging open-world detection tasks (Lemeshko et al., 25 Nov 2025).

1. Zoo3D $_1$ : System Architecture and Pipeline

Zoo3D $_1$ is built upon a two-stage pipeline:

Pseudo-Label Generation (Zoo3D $_0$ ):
- For each scene, given a point cloud $\mathcal{P}$ and $T$ posed RGB–D images, a set of 2D instance masks $\{m_{t,i}\}$ is predicted for each image $I_t$ .
- A mask–graph $G$ is constructed where nodes are 2D masks and edges connect masks with a high view-consensus-rate $\mathrm{cr}(\cdot,\cdot) \geq \tau_{rate}=0.9$ .
- Connected components yield 3D instance masks; corresponding axis-aligned 3D bounding boxes $b_g=(c_g,s_g)$ are derived by computing the min/max coordinates of the fused 3D points.
- These class-agnostic boxes form “pseudo-ground truth” for the subsequent self-supervised training.
Self-Supervised Detection (Zoo3D $_1$ ):
- Utilizes a modified TR3D detector with open-vocabulary and class-agnostic adaptations.
- Trained to regress box centers and sizes, and to score objectness, using only Zoo3D $_0$ -generated pseudo-labels from unlabeled scenes.
- No direct use of human-supplied object categories or scene labels.

The detection head predicts, at each candidate location $\hat v_j \in \mathbb{R}^3$ ,

objectness logit $\tilde z_j$ ,
center offset $\Delta c_j$ ,
log-sizes $\tilde s_j$ , yielding box center $c_j = \hat v_j + \Delta c_j$ , box size $s_j = \exp(\tilde s_j)$ , objectness probability $p_j = \sigma(\tilde z_j)$ .

This arrangement enables scalable, annotation-free open-vocabulary 3D detection.

2. Pseudo-Label Construction via View-Consensus Mask Clustering

Pseudo-label generation in Zoo3D $_0$ proceeds by mask–graph clustering based on 2D–3D view-consensus:

Each 2D mask is associated with the set of frames in which it appears, $F(m_{t,i})$ .
For each pair $(m_{t',i}, m_{t'',j})$ , the view-consensus rate is:

$\mathrm{cr}(m_{t',i}, m_{t'',j}) = \frac{|M(m_{t',i}) \cap M(m_{t'',j})|}{|F(m_{t',i}) \cap F(m_{t'',j})|}$

where $M(m_{t,i})$ is the set of all masks sharing points with $m_{t,i}$ .

Masks with $\mathrm{cr} \ge 0.9$ are joined; connected components in $G$ yield 3D point clusters.
Each cluster forms a 3D axis-aligned box:

$c_g = \frac{1}{2}(\min_{p \in P_g} p + \max_{p \in P_g} p), \quad s_g = \max_{p \in P_g} p - \min_{p \in P_g} p$

where $P_g$ is the set of points in the cluster.

The best-view selection and mask refinement use CLIP and SAM to assign open-vocabulary semantics to each box without closed-set limitations.

3. Self-Supervised Training Objectives

The Zoo3D $_1$ model is optimized using only the pseudo-labels produced by Zoo3D $_0$ , employing the following loss formulations:

Binary Focal Loss for Objectness:

$\mathcal L_\mathrm{focal} = -\sum_j \alpha_t (1 - p_j)^\gamma y_j \log p_j + (1-\alpha_t)p_j^\gamma (1 - y_j)\log(1 - p_j)$

where $y_j\in\{0,1\}$ denotes pseudo ground-truth objectness, and $\alpha_t, \gamma$ are standard focal loss hyperparameters.

Distance-IoU (DIoU) Loss for Box Regression:

$\mathcal L_\mathrm{DIoU} = 1 - \mathrm{IoU}(b_j, b_{gt}) + \frac{\rho^2(c_j, c_{gt})}{d^2}$

where $\rho$ is the center distance and $d$ is the diagonal of the minimal enclosing box.

Total Loss:

$\mathcal L = \mathcal L_\mathrm{focal} + \mathcal L_\mathrm{DIoU}$

No category branching is used in the head; all boxes are regressed at a single scale (16 cm).

4. Implementation and Training Protocol

Key implementation details of Zoo3D $_1$ include:

Backbone: Sparse 3D ResNet (FCAF3D style), features at multiple voxel scales (8, 16, 32, 64 cm).
Detection Head: Single-scale (16 cm), two-layer linear head for objectness and regression.
Training Data: 51 ScanNet scenes with pseudo-boxes, no human annotations.
Input Formats: 2 cm resolution point clouds, posed/unposed RGB–D, and direct image inputs via DUSt3R.
Optimization: AdamW, one-cycle learning rate schedule (peak $\approx$ 0.01), NMS IoU = 0.5 at inference, batch size as in TR3D.
Open-Vocabulary Assignment: CLIP ViT-H/14 for label assignment, SAM 2.1 (Hiera-L) for mask refinement.
Inference: Candidates filtered by NMS; semantic label via highest-mean CLIP similarity.

5. Quantitative Results and Ablations

Zoo3D $_1$ demonstrates improved open-vocabulary 3D detection over both its zero-shot predecessor and previous open or closed-set methods.

Method	SN20 mAP25/50	SN60 mAP25/50	SN200 mAP25/50
Zoo3D $_0$	34.7 / 23.9	27.1 / 18.7	21.1 / 14.1
Zoo3D $_1$	37.2 / 26.3	32.0 / 20.8	23.5 / 15.2

Performance is consistent whether operating on point clouds or posed/unposed RGB–D images. Iterative retraining (i.e., producing new pseudo-labels from refined detectors and retraining) yields further, albeit diminishing, improvements (e.g., DUSt3R→Zoo3D $_0$ [email protected]: 22.4, DUSt3R→Zoo3D $_1$ : 36.1, DUSt3R→Zoo3D $_2$ : 37.6). Prediction quality saturates as the number of input views approaches 45.

Ablation studies confirm that SAM-based mask refinement and multi-scale processing notably benefit recall and higher-IoU predictions. DUSt3R-based reconstruction is notably superior to alternatives such as DROID-SLAM when processing unposed images.

Self-supervised Zoo3D $_1$ contrasts with alternative self-supervised 3D detection pipelines in several respects:

Curiosity-driven 3D Detection (Griffiths et al., 2020): This system addresses 6-DOF object detection without any explicit labels, using analysis-by-synthesis supervision with adversarial "curiosity" critics and differentiable ray tracing, and is fundamentally limited to settings where object geometry is known a priori and scene composition is relatively simple.
Animal 3D Reconstruction (Kuang et al., 2023): Adopts a two-stage paradigm combining synthetic data supervision and multi-view consistency self-supervision for 3D digitization, focusing on reconstructing mesh, shape, and texture from images, but not performing class-agnostic, open-vocabulary detection at the scene level.

Zoo3D $_1$ provides a fully scalable, scene-level instance detection system capable of handling challenging benchmarks without requiring any 3D-class, category, or box-level ground-truth data, and operates with minimal prior geometric constraints.

7. Limitations and Future Directions

Zoo3D $_1$ depends on the internal quality of pseudo-labels generated by Zoo3D $_0$ ; systematic errors or ambiguities in graph clustering or mask prediction may propagate during retraining. Mask refinement, multi-scale decision fusion, and accurate pose estimation in unposed settings have substantial influence on ultimate detection quality. A plausible implication is that performance may further benefit from advances in 2D segmentation (SAM, CLIP, foundation models), occlusion reasoning, and self-supervised SLAM for image-only modalities. Iterative retraining demonstrates diminishing returns, suggesting inherent limitations in the pseudo-label regime unless new signal sources (e.g., richer multi-modal supervision or synthetic priors) are integrated (Lemeshko et al., 25 Nov 2025).