Limited Additional Knowledge from Naive Self-Training with Pseudo Labels

Ascertain whether naive self-teaching for monocular depth estimation—implemented by directly combining labeled images and pseudo-labeled unlabeled images for joint training—yields only limited additional visual knowledge when sufficient labeled data and strong pre-trained encoders are available, thereby failing to improve over training solely on labeled images.

Background

The paper proposes Depth Anything, a foundation model for robust monocular depth estimation trained on 1.5M labeled images and 62M unlabeled images. The authors initially attempted a self-training approach that combines labeled and pseudo-labeled unlabeled images, but observed no improvement over using labeled data alone.

They speculate that when the labeled dataset is already large and the teacher-student share strong pre-trained encoders and architectures, naive self-training may not provide significant additional knowledge, motivating their development of more challenging optimization targets and auxiliary semantic feature alignment to extract value from unlabeled data.

References

In our preliminary attempts, directly combining labeled and pseudo labeled images failed to improve the baseline of solely using labeled images. We conjecture that, the additional knowledge acquired in such a naive self-teaching manner is rather limited.

— Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data (2401.10891 - Yang et al., 19 Jan 2024) in Section 1 (Introduction)

Limited Additional Knowledge from Naive Self-Training with Pseudo Labels

Background

References

Related Problems