Papers
Topics
Authors
Recent
2000 character limit reached

Zoo3D1: Self-Supervised 3D Detection

Updated 2 December 2025
  • The paper introduces a self-supervised 3D detection framework that replaces manual annotations with pseudo-labels generated by a preceding zero-shot pipeline.
  • It employs a two-stage pipeline with mask-graph clustering and a modified TR3D detector to effectively fuse point clouds and RGB–D imagery for scene-level detection.
  • Quantitative results on ScanNet benchmarks demonstrate improved mAP scores, validating the scalability and robustness of iterative label-free refinement.

Self-supervised Zoo3D1_1 refers to a self-supervised 3D object detection framework situated within the broader Zoo3D family, whose chief innovation is performing scene-level 3D object detection without reliance on direct human-supplied semantic labels or 3D box annotations. Zoo3D1_1 operates by leveraging high-quality pseudo ground-truth labels generated by a preceding zero-shot pipeline (Zoo3D0_0), and refines detection performance via conventional detection model training using only these pseudo-labels. It establishes the feasibility of iterative, label-free refinement of open-vocabulary 3D detection in real-world, scene-scale data, supporting modalities such as point clouds, posed, and unposed RGB or RGB–D imagery, and sets new benchmarks on challenging open-world detection tasks (Lemeshko et al., 25 Nov 2025).

1. Zoo3D1_1: System Architecture and Pipeline

Zoo3D1_1 is built upon a two-stage pipeline:

  1. Pseudo-Label Generation (Zoo3D0_0):
    • For each scene, given a point cloud P\mathcal{P} and TT posed RGB–D images, a set of 2D instance masks {mt,i}\{m_{t,i}\} is predicted for each image ItI_t.
    • A mask–graph GG is constructed where nodes are 2D masks and edges connect masks with a high view-consensus-rate cr(,)τrate=0.9\mathrm{cr}(\cdot,\cdot) \geq \tau_{rate}=0.9.
    • Connected components yield 3D instance masks; corresponding axis-aligned 3D bounding boxes bg=(cg,sg)b_g=(c_g,s_g) are derived by computing the min/max coordinates of the fused 3D points.
    • These class-agnostic boxes form “pseudo-ground truth” for the subsequent self-supervised training.
  2. Self-Supervised Detection (Zoo3D1_1):
    • Utilizes a modified TR3D detector with open-vocabulary and class-agnostic adaptations.
    • Trained to regress box centers and sizes, and to score objectness, using only Zoo3D0_0-generated pseudo-labels from unlabeled scenes.
    • No direct use of human-supplied object categories or scene labels.

The detection head predicts, at each candidate location v^jR3\hat v_j \in \mathbb{R}^3,

  • objectness logit z~j\tilde z_j,
  • center offset Δcj\Delta c_j,
  • log-sizes s~j\tilde s_j, yielding box center cj=v^j+Δcjc_j = \hat v_j + \Delta c_j, box size sj=exp(s~j)s_j = \exp(\tilde s_j), objectness probability pj=σ(z~j)p_j = \sigma(\tilde z_j).

This arrangement enables scalable, annotation-free open-vocabulary 3D detection.

2. Pseudo-Label Construction via View-Consensus Mask Clustering

Pseudo-label generation in Zoo3D0_0 proceeds by mask–graph clustering based on 2D–3D view-consensus:

  • Each 2D mask is associated with the set of frames in which it appears, F(mt,i)F(m_{t,i}).
  • For each pair (mt,i,mt,j)(m_{t',i}, m_{t'',j}), the view-consensus rate is:

cr(mt,i,mt,j)=M(mt,i)M(mt,j)F(mt,i)F(mt,j)\mathrm{cr}(m_{t',i}, m_{t'',j}) = \frac{|M(m_{t',i}) \cap M(m_{t'',j})|}{|F(m_{t',i}) \cap F(m_{t'',j})|}

where M(mt,i)M(m_{t,i}) is the set of all masks sharing points with mt,im_{t,i}.

  • Masks with cr0.9\mathrm{cr} \ge 0.9 are joined; connected components in GG yield 3D point clusters.
  • Each cluster forms a 3D axis-aligned box:

cg=12(minpPgp+maxpPgp),sg=maxpPgpminpPgpc_g = \frac{1}{2}(\min_{p \in P_g} p + \max_{p \in P_g} p), \quad s_g = \max_{p \in P_g} p - \min_{p \in P_g} p

where PgP_g is the set of points in the cluster.

The best-view selection and mask refinement use CLIP and SAM to assign open-vocabulary semantics to each box without closed-set limitations.

3. Self-Supervised Training Objectives

The Zoo3D1_1 model is optimized using only the pseudo-labels produced by Zoo3D0_0, employing the following loss formulations:

Lfocal=jαt(1pj)γyjlogpj+(1αt)pjγ(1yj)log(1pj)\mathcal L_\mathrm{focal} = -\sum_j \alpha_t (1 - p_j)^\gamma y_j \log p_j + (1-\alpha_t)p_j^\gamma (1 - y_j)\log(1 - p_j)

where yj{0,1}y_j\in\{0,1\} denotes pseudo ground-truth objectness, and αt,γ\alpha_t, \gamma are standard focal loss hyperparameters.

  • Distance-IoU (DIoU) Loss for Box Regression:

LDIoU=1IoU(bj,bgt)+ρ2(cj,cgt)d2\mathcal L_\mathrm{DIoU} = 1 - \mathrm{IoU}(b_j, b_{gt}) + \frac{\rho^2(c_j, c_{gt})}{d^2}

where ρ\rho is the center distance and dd is the diagonal of the minimal enclosing box.

  • Total Loss:

L=Lfocal+LDIoU\mathcal L = \mathcal L_\mathrm{focal} + \mathcal L_\mathrm{DIoU}

No category branching is used in the head; all boxes are regressed at a single scale (16 cm).

4. Implementation and Training Protocol

Key implementation details of Zoo3D1_1 include:

  • Backbone: Sparse 3D ResNet (FCAF3D style), features at multiple voxel scales (8, 16, 32, 64 cm).
  • Detection Head: Single-scale (16 cm), two-layer linear head for objectness and regression.
  • Training Data: 51 ScanNet scenes with pseudo-boxes, no human annotations.
  • Input Formats: 2 cm resolution point clouds, posed/unposed RGB–D, and direct image inputs via DUSt3R.
  • Optimization: AdamW, one-cycle learning rate schedule (peak \approx 0.01), NMS IoU = 0.5 at inference, batch size as in TR3D.
  • Open-Vocabulary Assignment: CLIP ViT-H/14 for label assignment, SAM 2.1 (Hiera-L) for mask refinement.
  • Inference: Candidates filtered by NMS; semantic label via highest-mean CLIP similarity.

5. Quantitative Results and Ablations

Zoo3D1_1 demonstrates improved open-vocabulary 3D detection over both its zero-shot predecessor and previous open or closed-set methods.

Method SN20 mAP25/50 SN60 mAP25/50 SN200 mAP25/50
Zoo3D0_0 34.7 / 23.9 27.1 / 18.7 21.1 / 14.1
Zoo3D1_1 37.2 / 26.3 32.0 / 20.8 23.5 / 15.2

Performance is consistent whether operating on point clouds or posed/unposed RGB–D images. Iterative retraining (i.e., producing new pseudo-labels from refined detectors and retraining) yields further, albeit diminishing, improvements (e.g., DUSt3R→Zoo3D0_0 [email protected]: 22.4, DUSt3R→Zoo3D1_1: 36.1, DUSt3R→Zoo3D2_2: 37.6). Prediction quality saturates as the number of input views approaches 45.

Ablation studies confirm that SAM-based mask refinement and multi-scale processing notably benefit recall and higher-IoU predictions. DUSt3R-based reconstruction is notably superior to alternatives such as DROID-SLAM when processing unposed images.

Self-supervised Zoo3D1_1 contrasts with alternative self-supervised 3D detection pipelines in several respects:

  • Curiosity-driven 3D Detection (Griffiths et al., 2020): This system addresses 6-DOF object detection without any explicit labels, using analysis-by-synthesis supervision with adversarial "curiosity" critics and differentiable ray tracing, and is fundamentally limited to settings where object geometry is known a priori and scene composition is relatively simple.
  • Animal 3D Reconstruction (Kuang et al., 2023): Adopts a two-stage paradigm combining synthetic data supervision and multi-view consistency self-supervision for 3D digitization, focusing on reconstructing mesh, shape, and texture from images, but not performing class-agnostic, open-vocabulary detection at the scene level.

Zoo3D1_1 provides a fully scalable, scene-level instance detection system capable of handling challenging benchmarks without requiring any 3D-class, category, or box-level ground-truth data, and operates with minimal prior geometric constraints.

7. Limitations and Future Directions

Zoo3D1_1 depends on the internal quality of pseudo-labels generated by Zoo3D0_0; systematic errors or ambiguities in graph clustering or mask prediction may propagate during retraining. Mask refinement, multi-scale decision fusion, and accurate pose estimation in unposed settings have substantial influence on ultimate detection quality. A plausible implication is that performance may further benefit from advances in 2D segmentation (SAM, CLIP, foundation models), occlusion reasoning, and self-supervised SLAM for image-only modalities. Iterative retraining demonstrates diminishing returns, suggesting inherent limitations in the pseudo-label regime unless new signal sources (e.g., richer multi-modal supervision or synthetic priors) are integrated (Lemeshko et al., 25 Nov 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Self-Supervised Zoo3D$_1$.