BEEP3D: End-to-End Pseudo-mask Generation
- The paper introduces a novel student–teacher framework that replaces multi-stage pseudo-labeling with an efficient end-to-end process for generating pseudo-masks.
- It leverages a center-based query refinement approach to improve spatial accuracy in overlapping 3D bounding regions.
- The method incorporates dual consistency losses to align teacher and student features, achieving near fully-supervised performance on benchmark tests.
BEEP3D-Box-supervised End-to-End Pseudo-mask Generation denotes an advanced framework for 3D instance segmentation using only 3D bounding box annotations, explicitly addressing the challenges of ambiguity in overlapping regions and the inefficiency of multi-stage pseudo-labeling pipelines. The BEEP3D framework adopts a unified student–teacher architecture where pseudo-mask generation and model learning are optimized in a single end-to-end process. The technical core of BEEP3D is its instance center–based query refinement for the teacher model and the introduction of two novel consistency losses, all of which ensure accurate point-to-instance assignments while maintaining computational efficiency (Yoo et al., 14 Oct 2025).
1. Unified Student–Teacher Framework
BEEP3D replaces traditional two-stage pseudo-labeler paradigms with a student–teacher framework, in which both student and teacher share the underlying transformer-based 3D segmentation model structure (e.g., MAFT backbone). The teacher generates dynamic pseudo-masks for the training batch; the student learns from these pseudo-labels as well as the original box annotations. At each optimization step, the teacher model parameters θ_T are updated via an Exponential Moving Average (EMA) of the student parameters θ_S:
where α is the EMA decay rate.
The teacher’s outputs directly supervise the student, facilitating online refinement of pseudo-masks and enabling an end-to-end training loop.
2. Instance Center–Based Query Refinement
BEEP3D introduces an instance center–conditioned query generator that enhances the spatial accuracy of mask proposals, especially in overlapping bounding-box regions where point-to-instance assignments are ambiguous.
- Let denote sampled query points (via Farthest Point Sampling).
- Let denote location vectors of the K instance centers (from 3D boxes).
The initial teacher queries are constructed by an attention operation:
At each decoder layer t, queries are recursively updated as:
This recursive centering focuses the pseudo-mask generation on box-internal regions, thereby improving precision for scenes with multiple, tightly packed objects.
3. Consistency Losses for Representation Alignment
To ensure that both semantic and geometric information encoded by the teacher is efficiently transferred to the student, BEEP3D employs two novel consistency losses:
a) Query Consistency Loss
The L1 distance between the content queries (after the final decoder layer L) from student and teacher:
This loss explicitly pushes the student’s instance embeddings to align with the instance center–refined teacher queries.
b) Masked-Feature Consistency Loss
Instance-aware features are aggregated for each instance using both the teacher’s assignation (via the union of ground-truth and pseudo-masks) and the student’s predicted mask:
$F'_k^T = \frac{1}{n_k^T} \sum_{j=1}^N (m_{kj}^l \cup \hat{m}_{kj}^u) f_j^T \ F'_k^S = \frac{1}{n_k^S} \sum_{j=1}^N \hat{m}_{kj} f_j^S$
The loss is then given by an L2 distance:
$\mathcal{L}_f = \sum_{k=1}^K \| F'_k^T - F'_k^S \|_2^2$
Combined, these mechanisms enforce that both query and downstream features in the student network match those of the teacher, refining both mask predictions and instance semantics.
4. Pseudo-mask Generation and Assignment
Pseudo-masks are generated in the teacher network using the final layer queries and the learned point-wise features :
The assignment of points in overlapping regions is performed using the Hungarian Algorithm, choosing the query with the highest similarity for each point. This systematic procedure enables robust mask generation even in challenging scenarios with heavily overlapping boxes.
5. Training Objective
The ultimate training target for BEEP3D combines the segmentation/network losses and the two consistency losses:
is composed of standard binary cross-entropy, dice, and classification loss components, while and weight the contribution of the two consistency losses above.
6. Empirical Performance and Computational Efficiency
On standard 3D instance segmentation benchmarks such as ScanNetV2 and S3DIS, BEEP3D shows superior accuracy compared to other state-of-the-art weakly supervised methods, achieving up to 98% of the fully supervised baseline’s performance for stringent metrics such as AP@50. The method achieves these results with reduced computational overhead relative to two-stage pipelines, because the teacher serves as an online pseudo-labeler (via EMA) and does not require separate pre-training or inference passes. This feature makes BEEP3D empirically favorable over alternatives such as BSNet, Box2Mask, and GaPro (Yoo et al., 14 Oct 2025).
7. Significance and Implications
BEEP3D’s design principle—integrating instance center–driven queries with dual consistency losses in an EMA-updated teacher–student paradigm—addresses both spatial ambiguity in weak box-level annotations and inefficiency in pseudo-mask generation. This results in efficient, end-to-end, and scalable 3D instance segmentation under weak supervision, with empirical evidence suggesting high potential for real-world deployment in domains where dense labels are impractical. The approach also offers a blueprint for future research in joint pseudo-label optimization, feature alignment, and online consistency-driven learning strategies for other weakly supervised vision tasks.