Conjectured causes of SpatialVID underperformance relative to DL3DV
Ascertain whether the inferior performance of E-RayZer when trained on the SpatialVID video dataset, compared to training on DL3DV, is caused by the noisy nature of in-the-wild data, the use of coarse dynamic-ratio labels for selecting training subsets, and the prevalence of simple or near-static camera motions in SpatialVID. Establish the extent to which these factors, individually and jointly, account for the observed performance gap to inform data curation strategies for self-supervised 3D visual pre-training.
Sponsor
References
We conjecture that this gap stems from the noisy nature of in-the-wild data: SpatialVID sequences originate primarily from internet videos, and our training subsets are selected using their coarse dynamic-ratio labels. Also, SpatialVID often features simple or near-static camera motions.