Scaling 4D Representations (2412.15212v1)

Published 19 Dec 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.

Summary

The paper shows that scaling Masked Autoencoders up to 22B parameters significantly enhances 4D video self-supervision for camera pose, depth, and tracking tasks.
It utilizes a Vision Transformer framework with MAE on 170 million video clips to effectively capture spatio-temporal dynamics.
Empirical results indicate that self-supervised video models outperform traditional image-based and language-guided approaches in physical representation tasks.

Evaluating Scaled 4D Representations in Self-Supervised Learning

The paper "Scaling 4D Representations" addresses the challenge of enhancing self-supervised learning approaches for video data, particularly in the context of vision tasks that extend beyond semantic interpretations to those that are spatio-temporal, i.e., "4D". The authors focus on tasks such as camera pose estimation, depth estimation, and point and object tracking, where temporal and spatial reasoning are pivotal. The core of the research lies in determining whether large-scale video data, processed via Masked Autoencoders (MAE) based on transformer architectures, can effectively scale performance in these complex tasks without requiring semantic labels.

Methodology

The research investigates a series of Vision Transformer (ViT) models scaling from 20M to a considerably larger 22B parameters, which stands as one of the largest self-supervised video models reported to date. The framework applies MAE, where video input is partially masked, compelling the model to reconstruct the original sequences. This approach is applied across a comprehensive video dataset, encompassing 170 million clips, each exceedingly temporal in nature. This paper aims to counter the notion that MAE does not scale effectively, especially given prior skepticism in the field.

Evaluation and Results

The evaluation employs a suite of tasks that demand 4D reasoning capabilities:

Something-Something V2 Dataset for Action Classification: Although generally a benchmark for temporal understanding, this task here serves to test temporal sensitivity more than semantic precision.
Camera Pose Estimation using RealEstate10k: This task tests the prediction of 6DoF relative poses, essentials in navigation and augmented reality applications.
Point Tracking from the Perception Test: This measures a model's ability to continuously track a point through sequences, indicative of its dynamic scene understanding.
Object Tracking in the Waymo Open Dataset: Evaluates the model's capacitation to follow objects through occlusions and diverse environmental conditions.
Depth Estimation in ScanNet: Provides a relevant test ground for evaluating spatial reasoning and monocular depth prediction capabilities.

The paper establishes that video-centric models, particularly those not guided by language labels, surpass traditional image models (e.g., DinoV2, SigLIP) in these settings. Interestingly, while language supervision based models like VideoPrism shine in semantic tasks, they falter in the physical representation tasks emphasized in this paper. The authors demonstrate through frozen-feature evaluations and finetuning scenarios that as model size increases, particularly in the range of 2B to 22B parameters, performance consistently improves across the board—evidencing robust scaling.

Contributions and Implications

The research contributions are multifold:

It challenges the prevailing assumption that MAE does not scale efficiently by providing empirical evidence to the contrary with trained models up to 22B parameters.
Introduces the 4DS model family that aims to be a new resource for researchers focused on video representation learning.
Effectively positions video self-supervised learning as a significant frontier, potentially shaping future AI applications concerning navigation, robotics, and video analytics.

From a theoretical perspective, these findings question whether the domain should re-evaluate the centrality of LLMs in situations where spatio-temporal understanding is key. Practically, the implications touch upon enhancing autonomous systems' abilities to interpret and react in real-time environments without exhaustive labeling and semantic datasets.

Future Directions

Future inquiries may involve iterating on the simple MAE framework to incorporate other advanced masking strategies, noise robustness, and training efficiencies that could even more significantly leverage vast video data. More exhaustive exploration regarding the optimal decoder-encoder configurations or even external memory mechanisms might also offer depth to this promising field of paper. Such advancements could bridge the capabilities of self-supervised learning in video, closing the performance gap with language and meaning-centered approaches, and inaugurate a transition toward genuinely intuitive machine vision.