Papers
Topics
Authors
Recent
Search
2000 character limit reached

Zoo3D1: Self-Supervised 3D Detection

Updated 2 December 2025
  • The paper introduces a self-supervised 3D detection framework that replaces manual annotations with pseudo-labels generated by a preceding zero-shot pipeline.
  • It employs a two-stage pipeline with mask-graph clustering and a modified TR3D detector to effectively fuse point clouds and RGB–D imagery for scene-level detection.
  • Quantitative results on ScanNet benchmarks demonstrate improved mAP scores, validating the scalability and robustness of iterative label-free refinement.

Self-supervised Zoo3D1_1 refers to a self-supervised 3D object detection framework situated within the broader Zoo3D family, whose chief innovation is performing scene-level 3D object detection without reliance on direct human-supplied semantic labels or 3D box annotations. Zoo3D1_1 operates by leveraging high-quality pseudo ground-truth labels generated by a preceding zero-shot pipeline (Zoo3D0_0), and refines detection performance via conventional detection model training using only these pseudo-labels. It establishes the feasibility of iterative, label-free refinement of open-vocabulary 3D detection in real-world, scene-scale data, supporting modalities such as point clouds, posed, and unposed RGB or RGB–D imagery, and sets new benchmarks on challenging open-world detection tasks (Lemeshko et al., 25 Nov 2025).

1. Zoo3D1_1: System Architecture and Pipeline

Zoo3D1_1 is built upon a two-stage pipeline:

  1. Pseudo-Label Generation (Zoo3D0_0):
    • For each scene, given a point cloud P\mathcal{P} and TT posed RGB–D images, a set of 2D instance masks {mt,i}\{m_{t,i}\} is predicted for each image ItI_t.
    • A mask–graph 1_10 is constructed where nodes are 2D masks and edges connect masks with a high view-consensus-rate 1_11.
    • Connected components yield 3D instance masks; corresponding axis-aligned 3D bounding boxes 1_12 are derived by computing the min/max coordinates of the fused 3D points.
    • These class-agnostic boxes form “pseudo-ground truth” for the subsequent self-supervised training.
  2. Self-Supervised Detection (Zoo3D1_13):
    • Utilizes a modified TR3D detector with open-vocabulary and class-agnostic adaptations.
    • Trained to regress box centers and sizes, and to score objectness, using only Zoo3D1_14-generated pseudo-labels from unlabeled scenes.
    • No direct use of human-supplied object categories or scene labels.

The detection head predicts, at each candidate location 1_15,

  • objectness logit 1_16,
  • center offset 1_17,
  • log-sizes 1_18, yielding box center 1_19, box size 0_00, objectness probability 0_01.

This arrangement enables scalable, annotation-free open-vocabulary 3D detection.

2. Pseudo-Label Construction via View-Consensus Mask Clustering

Pseudo-label generation in Zoo3D0_02 proceeds by mask–graph clustering based on 2D–3D view-consensus:

  • Each 2D mask is associated with the set of frames in which it appears, 0_03.
  • For each pair 0_04, the view-consensus rate is:

0_05

where 0_06 is the set of all masks sharing points with 0_07.

  • Masks with 0_08 are joined; connected components in 0_09 yield 3D point clusters.
  • Each cluster forms a 3D axis-aligned box:

1_10

where 1_11 is the set of points in the cluster.

The best-view selection and mask refinement use CLIP and SAM to assign open-vocabulary semantics to each box without closed-set limitations.

3. Self-Supervised Training Objectives

The Zoo3D1_12 model is optimized using only the pseudo-labels produced by Zoo3D1_13, employing the following loss formulations:

1_14

where 1_15 denotes pseudo ground-truth objectness, and 1_16 are standard focal loss hyperparameters.

  • Distance-IoU (DIoU) Loss for Box Regression:

1_17

where 1_18 is the center distance and 1_19 is the diagonal of the minimal enclosing box.

  • Total Loss:

1_10

No category branching is used in the head; all boxes are regressed at a single scale (16 cm).

4. Implementation and Training Protocol

Key implementation details of Zoo3D1_11 include:

  • Backbone: Sparse 3D ResNet (FCAF3D style), features at multiple voxel scales (8, 16, 32, 64 cm).
  • Detection Head: Single-scale (16 cm), two-layer linear head for objectness and regression.
  • Training Data: 51 ScanNet scenes with pseudo-boxes, no human annotations.
  • Input Formats: 2 cm resolution point clouds, posed/unposed RGB–D, and direct image inputs via DUSt3R.
  • Optimization: AdamW, one-cycle learning rate schedule (peak 1_12 0.01), NMS IoU = 0.5 at inference, batch size as in TR3D.
  • Open-Vocabulary Assignment: CLIP ViT-H/14 for label assignment, SAM 2.1 (Hiera-L) for mask refinement.
  • Inference: Candidates filtered by NMS; semantic label via highest-mean CLIP similarity.

5. Quantitative Results and Ablations

Zoo3D1_13 demonstrates improved open-vocabulary 3D detection over both its zero-shot predecessor and previous open or closed-set methods.

Method SN20 mAP25/50 SN60 mAP25/50 SN200 mAP25/50
Zoo3D1_14 34.7 / 23.9 27.1 / 18.7 21.1 / 14.1
Zoo3D1_15 37.2 / 26.3 32.0 / 20.8 23.5 / 15.2

Performance is consistent whether operating on point clouds or posed/unposed RGB–D images. Iterative retraining (i.e., producing new pseudo-labels from refined detectors and retraining) yields further, albeit diminishing, improvements (e.g., DUSt3R→Zoo3D1_16 [email protected]: 22.4, DUSt3R→Zoo3D1_17: 36.1, DUSt3R→Zoo3D1_18: 37.6). Prediction quality saturates as the number of input views approaches 45.

Ablation studies confirm that SAM-based mask refinement and multi-scale processing notably benefit recall and higher-IoU predictions. DUSt3R-based reconstruction is notably superior to alternatives such as DROID-SLAM when processing unposed images.

Self-supervised Zoo3D1_19 contrasts with alternative self-supervised 3D detection pipelines in several respects:

  • Curiosity-driven 3D Detection (Griffiths et al., 2020): This system addresses 6-DOF object detection without any explicit labels, using analysis-by-synthesis supervision with adversarial "curiosity" critics and differentiable ray tracing, and is fundamentally limited to settings where object geometry is known a priori and scene composition is relatively simple.
  • Animal 3D Reconstruction (Kuang et al., 2023): Adopts a two-stage paradigm combining synthetic data supervision and multi-view consistency self-supervision for 3D digitization, focusing on reconstructing mesh, shape, and texture from images, but not performing class-agnostic, open-vocabulary detection at the scene level.

Zoo3D0_00 provides a fully scalable, scene-level instance detection system capable of handling challenging benchmarks without requiring any 3D-class, category, or box-level ground-truth data, and operates with minimal prior geometric constraints.

7. Limitations and Future Directions

Zoo3D0_01 depends on the internal quality of pseudo-labels generated by Zoo3D0_02; systematic errors or ambiguities in graph clustering or mask prediction may propagate during retraining. Mask refinement, multi-scale decision fusion, and accurate pose estimation in unposed settings have substantial influence on ultimate detection quality. A plausible implication is that performance may further benefit from advances in 2D segmentation (SAM, CLIP, foundation models), occlusion reasoning, and self-supervised SLAM for image-only modalities. Iterative retraining demonstrates diminishing returns, suggesting inherent limitations in the pseudo-label regime unless new signal sources (e.g., richer multi-modal supervision or synthetic priors) are integrated (Lemeshko et al., 25 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Supervised Zoo3D$_1$.