Identifying optimal pretext tasks and efficient video processing for self-supervised video learning
Identify optimal self-supervised pretext tasks for video representation learning and develop efficient video processing methods that enable effective and scalable learning from videos.
References
Despite progress, defining optimal pretext tasks and efficient video processing remain open challenges.
— V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning
(2603.14482 - Mur-Labadia et al., 15 Mar 2026) in Related work, Video Models (Section 7)