RGB-D Pseudo-Label Distillation

Updated 14 December 2025

The paper introduces an ensemble consensus mechanism combined with neural rendering to refine pseudo-labels for robust RGB-D annotations.
It employs depth decoupling, multi-task distillation, and confidence-aware strategies to enhance semantic segmentation, saliency detection, and depth estimation.
Quantitative evaluations demonstrate improvements such as 2D mIoU of 50.7% and 3D mIoU of 41.3% on ScanNet v2, highlighting scalability and cross-modal robustness.

Pseudo-label distillation via RGB-D refers to techniques that automatically generate dense semantic or structural annotations for RGB-D scans by leveraging multi-modal cues, ensemble consensus, and iterative refinement. Pseudo-labels are surrogate annotations derived through models or primitive algorithms, and then distilled—systematically refined for improved accuracy and reliability—using information from both visual and depth channels. These pseudo-labeling pipelines are foundational for scaling large-scale scene understanding, saliency detection, and depth estimation tasks in domains where manual annotation is prohibitive due to cost, domain expertise, or dataset size. The term subsumes variants such as ensemble voting fusion, neural rendering-based label projection, confidence-weighted distillation, and task-specific depth–RGB saliency consensus.

1. Ensemble Consensus and Voting Mechanisms

Pseudo-label distillation in RGB-D typically commences with ensemble construction, whereby multiple expert segmentation models, each pre-trained on distinct datasets or modalities, infer semantic labels for an RGB-D image or trajectory. For example, the LABELMAKER pipeline (Weder et al., 2023) fuses outputs from InternImage (ADE20k classes), OVSeg (CLIP-based open-vocabulary), CMX (RGB + depth, NYU40), and Mask3D (3D instance segmentation reprojected to 2D). All model outputs are mapped to a common label space (e.g., 186 WordNet synsets) via a precomputed translation table.

At each pixel $x$ , pseudo-labels are formed through weighted consensus voting:

$L_{\rm 2D}(x) = \begin{cases} \arg\max_c \sum_i v_i\,\mathbf{1}[c_i(x)=c] &\text{if }\max_c\sum_i v_i\,\mathbf{1}[c_i(x)=c]\ge\tau \ \text{unknown} &\text{otherwise} \end{cases}$

where $v_i$ is the vote strength per model (plus test-time augmentations) and $\tau$ is a coverage threshold. This explicit voting mechanism mitigates the single-model bias, enables robustness to noisy predictions, and supports multi-domain knowledge integration.

After initial 2D consensus, pseudo-labels are lifted to 3D using neural rendering approaches. LABELMAKER implements this using Neus-Acc (an SDF-NeRF variant), fitting a signed distance field (SDF) surface and optimizing a compact semantic head for per-class logit maps. Semantic reconstruction is enforced via a consensus loss:

$\mathcal{L}_{\rm sem} = -\sum_{x}\sum_{c} \mathbf{1}[L_{\rm 2D}(x)=c]\;\log\;\hat p(c\mid x)$

where $\hat p(c\mid x)$ is computed by ray-integral rendering over the semantic head. The joint optimization reconstructs RGB, depth, normals, and semantics, regularizing the surface via the eikonal loss. This 3D distillation suppresses per-frame noise and delivers multi-view-consistent annotations applicable for training point cloud or mesh-based models.

3. Depth Decoupling and Multi-Task Distillation in Saliency Detection

In RGB-D saliency detection frameworks (e.g., DS-Net (Wang et al., 2022)), pseudo-label distillation is achieved by depth decoupling and teacher–student consistency learning. The Depth-Decoupling CNN (DDCNN) learns parallel branches for depth estimation and saliency prediction. Pseudo-depth maps are generated for unlabeled RGB images via the frozen depth branch:

$L_{\rm depth}(x,d) = \| P_d(x) - d \|_2^2$

$\hat d = P_d(y)$

Unlabeled RGB and pseudo-depth pairs are fed to student and teacher networks, with saliency and intermediate attention consistency losses imposed across outputs:

$L_{c}(y) = L_{MSE}(S_s, S_t) + \gamma \sum_{\ell=1}^{4} L_{MSE}(A^{s, \ell}, A^{t, \ell})$

Task-specific modules like Depth-Induced Fusion (DIM) and attention-consistent reconstruction further promote feature disentangling, making pseudo-labels maximally informative for both depth and saliency pathways.

Entirely unsupervised saliency pipelines (e.g., (Ji et al., 2022)) employ iterative pseudo-label refinement via depth-guided branch separation and attentive weighting. Initial pseudo-labels, such as CDCP-based masks, are first generated for each RGB-D image. A depth encoder extracts features, which are then split into saliency-guided and non-saliency branches using spatial guidance masks derived from smoothed predictions:

$A_{\rm Sal} = \Psi_{\max}(F_G(Sal_{\rm pred}, k), Sal_{\rm pred})$

$A_{\rm Non} = \Psi_{\max}(F_G(1 - Sal_{\rm pred}, k), 1 - Sal_{\rm pred})$

Updated pseudo-labels are constructed by element-wise addition and subtraction:

$S_{\rm temp}^{i,j} = Sal_{\rm pred}^{i,j} + D_{\rm Sal}^{i,j} - D_{\rm NonSal}^{i,j}$

Normalization and conditional random field post-processing yield the next pseudo-label generation. Attentive training alternates between uniform and loss-adaptive sample weights, increasing robustness to noisy supervision.

In domains where RGB-derived pseudo-labels are transferred to a distinct input modality (e.g., thermal, depth) with limited labeled data, confidence-aware distillation controls pseudo-label reliability. The MonoTher-Depth pipeline (Zuo et al., 21 Apr 2025) transfers DepthAnything’s RGB pseudo-depth via geometric warping to thermal frames, then predicts pixel-wise confidence $\alpha_i$ using feature similarity and residual magnitude as input for a U-Net classifier:

$\alpha_i = f_{\rm conf}(I^{\rm RGB}, \mathbf{F}_r, \mathbf{F}_t, \hat D_r, \breve D_{tr}, \dots)_i, \quad \alpha_i \in (0,1)$

The distillation loss is weighted by these confidences, with outlier exclusion and feature masking:

$\mathcal{L}_{\rm distill} = \frac{1}{M} \sum_{i \in \Omega} \alpha_i |\hat D^i_r - \breve D^i_{tr}|$

This mechanism limits negative transfer effects and preserves accuracy for self-supervised adaptation in cross-modal scenarios.

6. Quantitative Evaluation and Practical Considerations

Pseudo-label distillation via RGB-D demonstrates consistent quantitative gains and robust performance in complex annotation regimes. LABELMAKER (Weder et al., 2023) achieves 2D mIoU 50.7% and 3D mIoU 41.3% on ScanNet v2 in fully automatic settings, outperforming human-labeled baselines by 5.7 (2D) and 4.0 (3D) mIoU, while boosting long-tail class coverage on 186 synsets by 1–2 mIoU. DS-Net (Wang et al., 2022) secures S_m=0.950, F^{max_β=0.965,} MAE=0.024 on seven RGB-D saliency benchmarks, with ablation confirming the necessity of multi-branch depth and attention consistency.

Practical deployment considerations include label space harmonization (translation lookup tables), coverage–noise trade-off via $\tau$ thresholding, computational scaling of neural rendering, and downstream semi-supervised student training. Cross-modal systems require calibrated intrinsics/extrinsics and careful feature alignment for warping. Iterative refinement and sample weighting are vital for unsupervised scenarios.

7. Extensions and Generalization Potential

Label distillation frameworks for RGB-D generalize to tasks such as cross-modality (thermal, stereo, audio–visual), unsupervised object segmentation, and camouflaged or co-saliency detection using disentangled update and attentive strategies. Prospective enhancements include continuous embedding heads for open-vocabulary distillation (Weder et al., 2023), integration of edge-aware or contrastive cues, and uncertainty masking via MC-Dropout for "high-noise" pixel suppression (Ji et al., 2022). The iterative, modular pipelines inherent to RGB-D pseudo-label distillation form a substrate for scaling semantic scene understanding to vast, weakly-labeled environments.