Identifying optimal pretext tasks and efficient video processing for self-supervised video learning

Identify optimal self-supervised pretext tasks for video representation learning and develop efficient video processing methods that enable effective and scalable learning from videos.

Background

The related work section surveys early and recent approaches to video self-supervision, including temporal order verification, contrastive learning, predictive modeling, and masked video modeling. It notes continued progress alongside unresolved design choices.

The authors explicitly state that selecting the best pretext tasks and designing efficient video processing pipelines remain unresolved, emphasizing the need for principled objectives and scalable processing strategies.

References

Despite progress, defining optimal pretext tasks and efficient video processing remain open challenges.

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning  (2603.14482 - Mur-Labadia et al., 15 Mar 2026) in Related work, Video Models (Section 7)