Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Physical Dynamics for Object-centric Visual Prediction (2403.10079v1)

Published 15 Mar 2024 in cs.CV and cs.AI

Abstract: The ability to model the underlying dynamics of visual scenes and reason about the future is central to human intelligence. Many attempts have been made to empower intelligent systems with such physical understanding and prediction abilities. However, most existing methods focus on pixel-to-pixel prediction, which suffers from heavy computational costs while lacking a deep understanding of the physical dynamics behind videos. Recently, object-centric prediction methods have emerged and attracted increasing interest. Inspired by it, this paper proposes an unsupervised object-centric prediction model that makes future predictions by learning visual dynamics between objects. Our model consists of two modules, perceptual, and dynamic module. The perceptual module is utilized to decompose images into several objects and synthesize images with a set of object-centric representations. The dynamic module fuses contextual information, takes environment-object and object-object interaction into account, and predicts the future trajectory of objects. Extensive experiments are conducted to validate the effectiveness of the proposed method. Both quantitative and qualitative experimental results demonstrate that our model generates higher visual quality and more physically reliable predictions compared to the state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Y. Ding, Z. Zhang, Y. Li, and X. Zhou, “Egospeed-net: forecasting speed-control in driver behavior from egocentric video data,” in Proceedings of the 30th International Conference on Advances in Geographic Information Systems, pp. 1–10, 2022.
  2. X. Zang, M. Yin, L. Huang, J. Yu, S. Zonouz, and B. Yuan, “Robot motion planning as video prediction: A spatio-temporal neural network-based motion planner,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 12492–12499, IEEE, 2022.
  3. Y. Ye, D. Gandhi, A. Gupta, and S. Tulsiani, “Object-centric forward modeling for model predictive control,” in Conference on Robot Learning, pp. 100–109, PMLR, 2020.
  4. J. Duan, A. Dasgupta, J. Fischer, and C. Tan, “A survey on machine learning approaches for modelling intuitive physics,” arXiv preprint arXiv:2202.06481, 2022.
  5. L. Weihs, A. Yuile, R. Baillargeon, C. Fisher, G. Marcus, R. Mottaghi, and A. Kembhavi, “Benchmarking progress to infant-level physical reasoning in ai,” Transactions on Machine Learning Research, 2022.
  6. A. Bakhtin, L. van der Maaten, J. Johnson, L. Gustafson, and R. Girshick, “Phyre: A new benchmark for physical reasoning,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  7. R. Riochet, M. Y. Castro, M. Bernard, A. Lerer, R. Fergus, V. Izard, and E. Dupoux, “Intphys: A framework and benchmark for visual intuitive physics reasoning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  8. K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum, “Clevrer: Collision events for video representation and reasoning,” 2020.
  9. J. Wu, J. J. Lim, H. Zhang, J. B. Tenenbaum, and W. T. Freeman, “Physics 101: Learning physical object properties from unlabeled videos.,” in BMVC, vol. 2, p. 7, 2016.
  10. R. K. Kandukuri, J. Achterhold, M. Moeller, and J. Stueckler, “Physical representation learning and parameter identification from video using differentiable physics,” International Journal of Computer Vision, pp. 1–14, 2022.
  11. A. Lerer, S. Gross, and R. Fergus, “Learning physical intuition of block towers by example,” in International conference on machine learning, pp. 430–438, PMLR, 2016.
  12. D. Bear, E. Wang, D. Mrowca, F. J. Binder, H.-Y. Tung, R. Pramod, C. Holdaway, S. Tao, K. A. Smith, F.-Y. Sun, et al., “Physion: Evaluating physical prediction from vision in humans and machines,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  13. S. Oprea, P. Martinez-Gonzalez, A. Garcia-Garcia, J. A. Castro-Vargas, S. Orts-Escolano, J. Garcia-Rodriguez, and A. Argyros, “A review on deep learning techniques for video prediction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 2806–2826, 2020.
  14. Y. Wang, H. Wu, J. Zhang, Z. Gao, J. Wang, S. Y. Philip, and M. Long, “Predrnn: A recurrent neural network for spatiotemporal predictive learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 2, pp. 2208–2225, 2022.
  15. J. Wu, E. Lu, P. Kohli, B. Freeman, and J. Tenenbaum, “Learning to see physics via visual de-animation,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  16. M. Janner, S. Levine, W. T. Freeman, J. B. Tenenbaum, C. Finn, and J. Wu, “Reasoning about physical interactions with object-oriented prediction and planning,” in International Conference on Learning Representations, 2019.
  17. N. Ben Zikri and A. Sharf, “Phylonet: Physically-constrained long term video prediction,” in Proceedings of the Asian Conference on Computer Vision (ACCV), pp. 877–893, December 2022.
  18. G. Zhu, Z. Huang, and C. Zhang, “Object-oriented dynamics predictor,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  19. K. Schmeckpeper, G. Georgakis, and K. Daniilidis, “Object-centric video prediction without annotation,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 13604–13610, IEEE, 2021.
  20. Y. Ye, M. Singh, A. Gupta, and S. Tulsiani, “Compositional video prediction,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10352–10361, IEEE, 2019.
  21. H. Qi, X. Wang, D. Pathak, Y. Ma, and J. Malik, “Learning long-term visual dynamics with region proposal interaction networks,” in International Conference on Learning Representations, 2020.
  22. L. S. Piloto, A. Weinstein, P. Battaglia, and M. Botvinick, “Intuitive physics learning in a deep-learning model inspired by developmental psychology,” Nature human behaviour, vol. 6, no. 9, pp. 1257–1267, 2022.
  23. M. Minderer, C. Sun, R. Villegas, F. Cole, K. P. Murphy, and H. Lee, “Unsupervised learning of object structure and dynamics from videos,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  24. X. Gao, Y. Jin, Q. Dou, C.-W. Fu, and P.-A. Heng, “Accurate grid keypoint learning for efficient video prediction,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5908–5915, IEEE, 2021.
  25. X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” Advances in neural information processing systems, vol. 28, 2015.
  26. Y. Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu, “Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms,” Advances in neural information processing systems, vol. 30, 2017.
  27. Y. Wang, L. Jiang, M.-H. Yang, L.-J. Li, M. Long, and L. Fei-Fei, “Eidetic 3d lstm: A model for video prediction and beyond,” in International conference on learning representations, 2019.
  28. W. Yu, Y. Lu, S. Easterbrook, and S. Fidler, “Efficient and information-preserving future frame prediction and beyond,” in International Conference on Learning Representations, 2019.
  29. R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing motion and content for natural video sequence prediction,” in International Conference on Learning Representations, 2017.
  30. H. Wu, Z. Yao, J. Wang, and M. Long, “Motionrnn: A flexible model for video prediction with spacetime-varying motions,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15430–15439, 2021.
  31. V. Le Guen and N. Thome, “Disentangling physical dynamics from unknown factors for unsupervised video prediction,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11471–11481, 2020.
  32. J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European conference on computer vision, pp. 694–711, 2016.
  33. S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” Advances in neural information processing systems, vol. 28, 2015.
  34. M. Babaeizadeh, C. Finn, D. Erhan, R. Campbell, and S. Levine, “Stochastic variational video prediction,” in International Conference on Learning Representations, 2018.
  35. J.-Y. Franceschi, E. Delasalles, M. Chen, S. Lamprier, and P. Gallinari, “Stochastic latent residual video prediction,” in International Conference on Machine Learning, pp. 3233–3246, 2020.
  36. N. Watters, D. Zoran, T. Weber, P. Battaglia, R. Pascanu, and A. Tacchetti, “Visual interaction networks: Learning a physics simulator from video,” Advances in neural information processing systems, vol. 30, 2017.
  37. F. Baradel, N. Neverova, J. Mille, G. Mori, and C. Wolf, “Cophy: Counterfactual learning of physical dynamics,” in International Conference on Learning Representations, 2019.
  38. Y. Wu, R. Gao, J. Park, and Q. Chen, “Future video synthesis with object motion prediction,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5538–5547, 2020.
  39. A. Byravan, F. Leeb, F. Meier, and D. Fox, “Se3-pose-nets: Structured deep dynamics models for visuomotor control,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3339–3346, IEEE, 2018.
  40. T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi, “Unsupervised learning of object landmarks through conditional image generation,” in Advances in Neural Information Processing Systems, vol. 31, 2018.
  41. Y. Li, A. Torralba, A. Anandkumar, D. Fox, and A. Garg, “Causal discovery in physical systems from videos,” Advances in Neural Information Processing Systems, vol. 33, pp. 9180–9192, 2020.
  42. T. D. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, M. Reynolds, A. Zisserman, and V. Mnih, “Unsupervised Learning of Object Keypoints for Perception and Control,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
  43. S. JANNY, F. Baradel, N. Neverova, M. Nadri, G. Mori, and C. Wolf, “Filtered-cophy: Unsupervised learning of counterfactual physics in pixel space,” in International Conference on Learning Representations, 2021.
  44. E. Corona, A. Pumarola, G. Alenya, and F. Moreno-Noguer, “Context-Aware Human Motion Prediction,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (Seattle, WA, USA), pp. 6990–6999, IEEE, June 2020.
  45. B. Kim, S. H. Park, S. Lee, E. Khoshimjonov, D. Kum, J. Kim, J. S. Kim, and J. W. Choi, “Lapred: Lane-aware prediction of multi-modal future trajectories of dynamic agents,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (Nashville, TN, USA), pp. 14631–14640, IEEE, June 2021.
  46. H. Zhao, J. Gao, T. Lan, C. Sun, B. Sapp, B. Varadarajan, Y. Shen, Y. Shen, Y. Chai, C. Schmid, C. Li, and D. Anguelov, “Tnt: Target-driven trajectory prediction,” in Proceedings of the 2020 Conference on Robot Learning, pp. 895–904, PMLR, Oct. 2021. ISSN: 2640-3498.
  47. Y. Yuan, X. Weng, Y. Ou, and K. Kitani, “Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (Montreal, QC, Canada), pp. 9793–9803, IEEE, Oct. 2021.
  48. M. Lee, S. S. Sohn, S. Moon, S. Yoon, M. Kapadia, and V. Pavlovic, “Muse-vae: Multi-scale vae for environment-aware long term trajectory prediction,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2211–2220, 2022.
  49. A. Dundar, K. J. Shih, A. Garg, R. Pottorf, A. Tao, and B. Catanzaro, “Unsupervised disentanglement of pose, appearance and background from images and videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 7, pp. 3883–3894, 2022.
  50. P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende, et al., “Interaction networks for learning about objects, relations and physics,” Advances in neural information processing systems, vol. 29, 2016.
  51. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  52. Z. Long, Y. Lu, X. Ma, and B. Dong, “Pde-net: Learning pdes from data,” in Proceedings of the 35th International Conference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research, pp. 3208–3216, PMLR, 10–15 Jul 2018.
  53. B. Lusch, J. N. Kutz, and S. L. Brunton, “Deep learning for universal linear embeddings of nonlinear dynamics,” Nature Communications, vol. 9, p. 4950, Dec. 2018.
  54. O. Groth, F. B. Fuchs, I. Posner, and A. Vedaldi, “Shapestacks: Learning vision-based physical intuition for generalised object stacking,” in Proceedings of the european conference on computer vision (eccv), pp. 702–717, 2018.
  55. K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik, “Learning visual predictive models of physics for playing billiards,” arXiv preprint arXiv:1511.07404, 2015.
  56. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
  57. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 586–595, 2018.
  58. Z. Gao, C. Tan, L. Wu, and S. Z. Li, “Simvp: Simpler yet better video prediction,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3160–3170, 2022.
  59. X. Chen, C. Xu, X. Yang, and D. Tao, “Long-term video prediction via criticization and retrospection,” IEEE Transactions on Image Processing, vol. 29, pp. 7090–7103, 2020.
  60. M. Kampffmeyer, N. Dong, X. Liang, Y. Zhang, and E. P. Xing, “Connnet: A long-range relation-aware pixel-connectivity network for salient segmentation,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2518–2529, 2018.
  61. S. Chen, C. Ding, M. Liu, J. Cheng, and D. Tao, “Cpp-net: Context-aware polygon proposal network for nucleus segmentation,” IEEE Transactions on Image Processing, 2023.

Summary

We haven't generated a summary for this paper yet.