Conjectured causes of SpatialVID underperformance relative to DL3DV

Ascertain whether the inferior performance of E-RayZer when trained on the SpatialVID video dataset, compared to training on DL3DV, is caused by the noisy nature of in-the-wild data, the use of coarse dynamic-ratio labels for selecting training subsets, and the prevalence of simple or near-static camera motions in SpatialVID. Establish the extent to which these factors, individually and jointly, account for the observed performance gap to inform data curation strategies for self-supervised 3D visual pre-training.

Background

In the supplementary analysis of training data, the authors compare E-RayZer models trained on different datasets, including RealEstate10K, SpatialVID, DL3DV, and a multi-dataset mixture. Despite SpatialVID having a larger number of sequences, the model trained on SpatialVID underperforms relative to the model trained on the smaller but curated DL3DV dataset.

The authors explicitly conjecture that the performance gap is due to characteristics of SpatialVID: noisy in-the-wild videos, selection using coarse dynamic-ratio labels, and simple or near-static camera motions. They suggest that data quality and diversity, rather than sheer quantity, drive scalability in self-supervised 3D vision models, motivating a rigorous investigation of these factors.

References

We conjecture that this gap stems from the noisy nature of in-the-wild data: SpatialVID sequences originate primarily from internet videos, and our training subsets are selected using their coarse dynamic-ratio labels. Also, SpatialVID often features simple or near-static camera motions.

E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training (2512.10950 - Zhao et al., 11 Dec 2025) in Supplementary, Section 7: Further Analysis of Training Data