A Critical View of Vision-Based Long-Term Dynamics Prediction Under Environment Misalignment (2305.07648v2)
Abstract: Dynamics prediction, which is the problem of predicting future states of scene objects based on current and prior states, is drawing increasing attention as an instance of learning physics. To solve this problem, Region Proposal Convolutional Interaction Network (RPCIN), a vision-based model, was proposed and achieved state-of-the-art performance in long-term prediction. RPCIN only takes raw images and simple object descriptions, such as the bounding box and segmentation mask of each object, as input. However, despite its success, the model's capability can be compromised under conditions of environment misalignment. In this paper, we investigate two challenging conditions for environment misalignment: Cross-Domain and Cross-Context by proposing four datasets that are designed for these challenges: SimB-Border, SimB-Split, BlenB-Border, and BlenB-Split. The datasets cover two domains and two contexts. Using RPCIN as a probe, experiments conducted on the combinations of the proposed datasets reveal potential weaknesses of the vision-based long-term dynamics prediction model. Furthermore, we propose a promising direction to mitigate the Cross-Domain challenge and provide concrete evidence supporting such a direction, which provides dramatic alleviation of the challenge on the proposed datasets.
- Layer normalization, 2016.
- Phyre: A new benchmark for physical reasoning. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Interaction networks for learning about objects, relations and physics. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
- Bradski, G. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
- Deep clustering for unsupervised learning of visual features. In European Conference on Computer Vision, 2018.
- A compositional object-based approach to learning physical dynamics. In ICLR (Poster). OpenReview.net, 2017.
- Domain-specific batch normalization for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- Domain adaptive faster r-cnn for object detection in the wild. In Computer Vision and Pattern Recognition (CVPR), 2018.
- Grounding physical concepts of objects and events through dynamic visual reasoning. In International Conference on Learning Representations, 2021.
- Community, B. O. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
- End-to-end differentiable physics for learning and control. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- Dynamic visual reasoning by learning differentiable physics models from video and language. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 887–899. Curran Associates, Inc., 2021.
- Learning visual predictive models of physics for playing billiards. In ICLR (Poster), 2016.
- Unsupervised domain adaptation by backpropagation. In ICML, 2015.
- Girshick, R. Fast r-cnn. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448, 2015. doi: 10.1109/ICCV.2015.169.
- Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 448–456. JMLR.org, 2015.
- Reasoning about physical interactions with object-centric models. In International Conference on Learning Representations, 2019.
- Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9865–9874, 2019.
- Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- On the learning mechanisms in physical reasoning. In NeurIPS, 2022.
- Revisiting batch normalization for practical domain adaptation. In ICLR, 2017.
- Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning (ICML), pp. 6028–6039, 2020.
- Source-free domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1215–1224, June 2021.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
- SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
- Stacked hourglass networks for human pose estimation. In Leibe, B., Matas, J., Sebe, N., and Welling, M. (eds.), Computer Vision – ECCV 2016, pp. 483–499, Cham, 2016. Springer International Publishing.
- Moment matching for multi-source domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1406–1415, 2019.
- Learning long-term visual dynamics with region proposal interaction networks. In ICLR, 2021.
- You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
- Semi-supervised domain adaptation via minimax entropy. ICCV, 2019.
- Learning to simulate complex physics with graph networks. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 8459–8468. PMLR, 13–18 Jul 2020.
- Transferable curriculum for weakly-supervised domain adaptation. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):4951–4958, Jul. 2019. doi: 10.1609/aaai.v33i01.33014951.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033, 2012. doi: 10.1109/IROS.2012.6386109.
- Learning physics from dynamical scenes. In Proceedings of the 36th Annual Conference of the Cognitive Science society, pp. 1640–1645, 2014.
- Instance normalization: The missing ingredient for fast stylization, 2016.
- Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In CVPR, 2019.
- Visual interaction networks: Learning a physics simulator from video. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
- Learning to see physics via visual de-animation. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
- Compositional video prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
- Clevrer: Collision events for video representation and reasoning. In International Conference on Learning Representations, 2020.
- Universal domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- Importance weighted adversarial nets for partial domain adaptation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8156–8164, 2018. doi: 10.1109/CVPR.2018.00851.