Intuitive physics understanding emerges from self-supervised pretraining on natural videos (2502.11831v1)

Published 17 Feb 2025 in cs.CV and cs.AI

Abstract: We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regions in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal LLMs, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper presents V-JEPA, a self-supervised model that predicts masked video representations to develop an understanding of intuitive physics.
The study uses the violation-of-expectation paradigm and a surprise metric to quantitatively evaluate concepts like object permanence and gravity.
Robust performance is demonstrated across benchmarks, including a 98% zero-shot accuracy on IntPhys, outperforming traditional pixel-based models.

Intuitive Physics Understanding Emerges from Self-Supervised Pretraining on Natural Videos

The paper "Intuitive physics understanding emerges from self-supervised pretraining on natural videos" explores the capacity of general-purpose deep neural network models to develop an understanding of intuitive physics through self-supervised video prediction. Utilizing the violation-of-expectation paradigm, the authors evaluated how well these models comprehend concepts such as object permanence and shape consistency, comparing their performance to pixel-based generative models and multimodal LLMs (MLLMs).

Methodology and Approach

The paper introduces Joint Embedding Predictive Architectures (JEPAs), focusing on a particular instantiation named V-JEPA. V-JEPA is trained to predict masked representations in videos, which involves using an encoder to extract representations and a predictor to forecast unobserved portions of the video in this learned representation space. The surprise metric is central to evaluating model performance, measuring prediction error when comparing predicted and observed video representations.

Figure 1: Video prediction in representation space (V-JEPA) achieves an understanding of intuitive physics.

The authors detail how the V-JEPA model utilizes the learned representation space to infer future states in a video and quantify surprise. This approach outperforms simplicity models based on text or pixel prediction.

Evaluation of Intuitive Physics Understanding

The paper evaluates intuitive physics understanding using three datasets: IntPhys, GRASP, and InfLevel. These benchmarks test various physical properties, including object permanence, continuity, and gravity. V-JEPA consistently performs above chance across these datasets, even achieving impressive 98% zero-shot accuracy on the IntPhys benchmark.

Figure 2: V-JEPA accuracy increase relative to randomly-initialized models and humans across different physical properties and benchmarks.

The superior performance of V-JEPA over other models stems from its ability to encode high-level abstract representations rather than relying on low-level semantic features, illustrating its robustness and reliability in discerning physical violations.

Performance Analysis and Ablation Studies

Ablation studies reveal that various design choices influence the emergence of intuitive physics understanding. The training data's diversity, model size, and prediction tasks are significant factors in performance.

Figure 3: Influence of type of mask, type and amount of training data, and model size on V-JEPA IntPhys scores.

V-JEPA achieves non-trivial performance even with smaller models or reduced training data, underscoring the adequacy of video prediction in a learned representation space for intuitive physics comprehension.

Implications and Future Directions

The results obtained demonstrate significant coverage of intuitive physics by V-JEPA without the need for predefined abstractions. This indicates potential pathways for developing AI systems with enhanced real-world understanding. Notably, current multimodal LLMs and pixel-based generative models fall short in comparison, suggesting areas for improvement in handling intuitive physics tasks.

Figure 4: Different surprise measures are better suited for different tasks.

Future studies could focus on enhancing model memory or leveraging hierarchical architectures to capture complex object interactions and contextual dependencies. Additionally, exploring infant-centric video data could yield insights into intuitive physics learning analogous to human development.

Conclusion

The paper successfully demonstrates that intuitive physics understanding can emerge from self-supervised learning models without hardwired knowledge structures. V-JEPA's performance across various benchmarks confirms the effectiveness of predicting in representation space, marking a significant step in bridging the gap between artificial systems and human-level perception. This work lays the groundwork for further research aimed at enriching AI models' physical world comprehension, paving the way for intelligent systems that understand and interact with their environment in more human-like ways.

PDF Markdown

Follow-up Questions

Related Papers

Authors (8)

Tweets

https://twitter.com/NYUDataScience/status/1910034034144555445

https://twitter.com/garridoq_/status/1891885246573908066

https://twitter.com/lucasmaes_/status/1893745006675976414

https://twitter.com/twelve_labs/status/1940294263105102072

https://twitter.com/gerrybarberena/status/1921014432244625548

https://twitter.com/GabrielAsher02/status/1924940114624487781