Zoo3D1: Self-Supervised 3D Detection
- The paper introduces a self-supervised 3D detection framework that replaces manual annotations with pseudo-labels generated by a preceding zero-shot pipeline.
- It employs a two-stage pipeline with mask-graph clustering and a modified TR3D detector to effectively fuse point clouds and RGB–D imagery for scene-level detection.
- Quantitative results on ScanNet benchmarks demonstrate improved mAP scores, validating the scalability and robustness of iterative label-free refinement.
Self-supervised Zoo3D refers to a self-supervised 3D object detection framework situated within the broader Zoo3D family, whose chief innovation is performing scene-level 3D object detection without reliance on direct human-supplied semantic labels or 3D box annotations. Zoo3D operates by leveraging high-quality pseudo ground-truth labels generated by a preceding zero-shot pipeline (Zoo3D), and refines detection performance via conventional detection model training using only these pseudo-labels. It establishes the feasibility of iterative, label-free refinement of open-vocabulary 3D detection in real-world, scene-scale data, supporting modalities such as point clouds, posed, and unposed RGB or RGB–D imagery, and sets new benchmarks on challenging open-world detection tasks (Lemeshko et al., 25 Nov 2025).
1. Zoo3D: System Architecture and Pipeline
Zoo3D is built upon a two-stage pipeline:
- Pseudo-Label Generation (Zoo3D):
- For each scene, given a point cloud and posed RGB–D images, a set of 2D instance masks is predicted for each image .
- A mask–graph is constructed where nodes are 2D masks and edges connect masks with a high view-consensus-rate .
- Connected components yield 3D instance masks; corresponding axis-aligned 3D bounding boxes are derived by computing the min/max coordinates of the fused 3D points.
- These class-agnostic boxes form “pseudo-ground truth” for the subsequent self-supervised training.
- Self-Supervised Detection (Zoo3D):
- Utilizes a modified TR3D detector with open-vocabulary and class-agnostic adaptations.
- Trained to regress box centers and sizes, and to score objectness, using only Zoo3D-generated pseudo-labels from unlabeled scenes.
- No direct use of human-supplied object categories or scene labels.
The detection head predicts, at each candidate location ,
- objectness logit ,
- center offset ,
- log-sizes , yielding box center , box size , objectness probability .
This arrangement enables scalable, annotation-free open-vocabulary 3D detection.
2. Pseudo-Label Construction via View-Consensus Mask Clustering
Pseudo-label generation in Zoo3D proceeds by mask–graph clustering based on 2D–3D view-consensus:
- Each 2D mask is associated with the set of frames in which it appears, .
- For each pair , the view-consensus rate is:
where is the set of all masks sharing points with .
- Masks with are joined; connected components in yield 3D point clusters.
- Each cluster forms a 3D axis-aligned box:
where is the set of points in the cluster.
The best-view selection and mask refinement use CLIP and SAM to assign open-vocabulary semantics to each box without closed-set limitations.
3. Self-Supervised Training Objectives
The Zoo3D model is optimized using only the pseudo-labels produced by Zoo3D, employing the following loss formulations:
- Binary Focal Loss for Objectness:
where denotes pseudo ground-truth objectness, and are standard focal loss hyperparameters.
- Distance-IoU (DIoU) Loss for Box Regression:
where is the center distance and is the diagonal of the minimal enclosing box.
- Total Loss:
No category branching is used in the head; all boxes are regressed at a single scale (16 cm).
4. Implementation and Training Protocol
Key implementation details of Zoo3D include:
- Backbone: Sparse 3D ResNet (FCAF3D style), features at multiple voxel scales (8, 16, 32, 64 cm).
- Detection Head: Single-scale (16 cm), two-layer linear head for objectness and regression.
- Training Data: 51 ScanNet scenes with pseudo-boxes, no human annotations.
- Input Formats: 2 cm resolution point clouds, posed/unposed RGB–D, and direct image inputs via DUSt3R.
- Optimization: AdamW, one-cycle learning rate schedule (peak 0.01), NMS IoU = 0.5 at inference, batch size as in TR3D.
- Open-Vocabulary Assignment: CLIP ViT-H/14 for label assignment, SAM 2.1 (Hiera-L) for mask refinement.
- Inference: Candidates filtered by NMS; semantic label via highest-mean CLIP similarity.
5. Quantitative Results and Ablations
Zoo3D demonstrates improved open-vocabulary 3D detection over both its zero-shot predecessor and previous open or closed-set methods.
| Method | SN20 mAP25/50 | SN60 mAP25/50 | SN200 mAP25/50 |
|---|---|---|---|
| Zoo3D | 34.7 / 23.9 | 27.1 / 18.7 | 21.1 / 14.1 |
| Zoo3D | 37.2 / 26.3 | 32.0 / 20.8 | 23.5 / 15.2 |
Performance is consistent whether operating on point clouds or posed/unposed RGB–D images. Iterative retraining (i.e., producing new pseudo-labels from refined detectors and retraining) yields further, albeit diminishing, improvements (e.g., DUSt3R→Zoo3D [email protected]: 22.4, DUSt3R→Zoo3D: 36.1, DUSt3R→Zoo3D: 37.6). Prediction quality saturates as the number of input views approaches 45.
Ablation studies confirm that SAM-based mask refinement and multi-scale processing notably benefit recall and higher-IoU predictions. DUSt3R-based reconstruction is notably superior to alternatives such as DROID-SLAM when processing unposed images.
6. Comparison to Related Self-Supervised Approaches
Self-supervised Zoo3D contrasts with alternative self-supervised 3D detection pipelines in several respects:
- Curiosity-driven 3D Detection (Griffiths et al., 2020): This system addresses 6-DOF object detection without any explicit labels, using analysis-by-synthesis supervision with adversarial "curiosity" critics and differentiable ray tracing, and is fundamentally limited to settings where object geometry is known a priori and scene composition is relatively simple.
- Animal 3D Reconstruction (Kuang et al., 2023): Adopts a two-stage paradigm combining synthetic data supervision and multi-view consistency self-supervision for 3D digitization, focusing on reconstructing mesh, shape, and texture from images, but not performing class-agnostic, open-vocabulary detection at the scene level.
Zoo3D provides a fully scalable, scene-level instance detection system capable of handling challenging benchmarks without requiring any 3D-class, category, or box-level ground-truth data, and operates with minimal prior geometric constraints.
7. Limitations and Future Directions
Zoo3D depends on the internal quality of pseudo-labels generated by Zoo3D; systematic errors or ambiguities in graph clustering or mask prediction may propagate during retraining. Mask refinement, multi-scale decision fusion, and accurate pose estimation in unposed settings have substantial influence on ultimate detection quality. A plausible implication is that performance may further benefit from advances in 2D segmentation (SAM, CLIP, foundation models), occlusion reasoning, and self-supervised SLAM for image-only modalities. Iterative retraining demonstrates diminishing returns, suggesting inherent limitations in the pseudo-label regime unless new signal sources (e.g., richer multi-modal supervision or synthetic priors) are integrated (Lemeshko et al., 25 Nov 2025).