How Far is Video Generation from World Model: A Physical Law Perspective (2411.02385v2)
Abstract: OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color > size > velocity > shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io
- 1x world model. 2024. URL https://www.1x.tech/discover/1x-world-model.
- Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning. Proceedings of the National Academy of Sciences, 117(47):29302–29310, 2020.
- Craft: A benchmark for causal reasoning about forces and interactions. arXiv preprint arXiv:2012.04293, 2020.
- Phyre: A new benchmark for physical reasoning. Advances in Neural Information Processing Systems, 32, 2019.
- Learning in high dimension always amounts to extrapolation. arXiv preprint arXiv:2110.09485, 2021.
- Laion-aesthetics v1. https://github.com/LAION-AI/laion-datasets/blob/main/laion-aesthetic.md, 2022.
- Material recognition in the wild with the materials in context database. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3479–3487, 2015.
- Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Estimating the material properties of fabric from video. In Proceedings of the IEEE international conference on computer vision, pp. 1984–1991, 2013.
- Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
- Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024.
- Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017.
- Discovery of physics from data: Universal laws and discrepancies. Frontiers in artificial intelligence, 3:25, 2020.
- Compositional generative modeling: A single model is not all you need. arXiv preprint arXiv:2402.01103, 2024.
- Datacomp: In search of the next generation of multimodal datasets. ArXiv, abs/2304.14108, 2023. URL https://api.semanticscholar.org/CorpusID:258352812.
- Vista: A generalizable driving world model with high fidelity and versatile controllability. arXiv preprint arXiv:2405.17398, 2024.
- Forward prediction for physical reasoning. arXiv preprint arXiv:2006.10734, 2020.
- Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
- Shapestacks: Learning vision-based physical intuition for generalised object stacking. In Proceedings of the european conference on computer vision (eccv), pp. 702–717, 2018.
- Recurrent world models facilitate policy evolution. Advances in neural information processing systems, 31, 2018.
- Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
- Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
- Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
- Diffusion models in low-level vision: A survey. arXiv preprint arXiv:2406.11138, 2024.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
- Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022b.
- Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023.
- Case-based or rule-based: How do transformers do the math? ICML, 2024.
- Image-to-image translation with conditional adversarial networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976, 2016. URL https://api.semanticscholar.org/CorpusID:6200260.
- Adriver-i: A general world model for autonomous driving. arXiv preprint arXiv:2311.13549, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15954–15964, 2023.
- Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
- Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 5404–5411, 2024.
- Benchmarks for physical reasoning ai. arXiv preprint arXiv:2312.10728, 2023.
- Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023.
- Natural language instructions induce compositional generalization in networks of neurons. Nature Neuroscience, 27(5):988–999, 2024.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234–241. Springer, 2015.
- Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
- Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Richard S Sutton. Reinforcement learning: An introduction. A Bradford Book, 2018.
- Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Swap attention in spatiotemporal diffusions for text-to-video generation. 2023. URL https://api.semanticscholar.org/CorpusID:258762479.
- Panda: A gigapixel-level human-centric video dataset. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3265–3275, 2020. doi: 10.1109/CVPR42600.2020.00333.
- Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
- Perception and simulation during concept learning. Psychological Review, 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633, 2023.
- Physics 101: Learning physical object properties from unlabeled videos. In BMVC, volume 2, pp. 7, 2016.
- How neural networks extrapolate: From feedforward to graph neural networks. arXiv preprint arXiv:2009.11848, 2020.
- Phy-q as a measure for physical reasoning intelligence. Nature Machine Intelligence, 5(1):83–93, 2023.
- Video enhancement with task-oriented flow. International Journal of Computer Vision, pp. 1–20, 2017. URL https://api.semanticscholar.org/CorpusID:40412298.
- Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023.
- Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019.
- Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10459–10469, 2023a.
- Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023b.
- Make pixels dance: High-dynamic video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8850–8860, 2024.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595, 2018.
- I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023a.
- Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023b.
- Genad: Generative end-to-end autonomous driving. arXiv preprint arXiv:2402.11502, 2024.