State-space Decomposition Model for Video Prediction Considering Long-term Motion Trend (2404.11576v1)
Abstract: Stochastic video prediction enables the consideration of uncertainty in future motion, thereby providing a better reflection of the dynamic nature of the environment. Stochastic video prediction methods based on image auto-regressive recurrent models need to feed their predictions back into the latent space. Conversely, the state-space models, which decouple frame synthesis and temporal prediction, proves to be more efficient. However, inferring long-term temporal information about motion and generalizing to dynamic scenarios under non-stationary assumptions remains an unresolved challenge. In this paper, we propose a state-space decomposition stochastic video prediction model that decomposes the overall video frame generation into deterministic appearance prediction and stochastic motion prediction. Through adaptive decomposition, the model's generalization capability to dynamic scenarios is enhanced. In the context of motion prediction, obtaining a prior on the long-term trend of future motion is crucial. Thus, in the stochastic motion prediction branch, we infer the long-term motion trend from conditional frames to guide the generation of future frames that exhibit high consistency with the conditional frames. Experimental results demonstrate that our model outperforms baselines on multiple datasets.
- Slamp: Stochastic latent appearance and motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14728–14737, 2021.
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6836–6846, 2021.
- Strpm: A spatiotemporal residual predictive model for high-resolution video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13946–13955, 2022.
- A hierarchical variational neural uncertainty model for stochastic video prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9751–9761, 2021.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3213–3223, 2016.
- Unsupervised learning of disentangled representations from video. Advances in Neural Information Processing Systems, 30:1–10, 2017.
- Stochastic video generation with a learned prior. In International Conference on Machine Learning, pages 1174–1183, 2018.
- Navdreams: Towards camera-only rl navigation among humans. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2504–2511, 2022.
- Self-supervised visual planning with temporal skip connections. CoRL, 12:16, 2017.
- Local frequency domain transformer networks for video prediction. In 2021 International Joint Conference on Neural Networks, pages 1–10, 2021.
- Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation, pages 2786–2793, 2017.
- Stochastic latent residual video prediction. In International Conference on Machine Learning, pages 3233–3246, 2020.
- Disentangling propagation and generation for video prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9006–9015, 2019.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012.
- It’s raw! audio generation with state-space models. In International Conference on Machine Learning, pages 7616–7633, 2022.
- Temporal difference variational auto-encoder. arXiv Preprint arXiv:1806.03107, 2018.
- Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pages 2555–2565, 2019.
- Latent structured models for human pose estimation. In 2011 International Conference on Computer Vision, pages 2220–2227, 2011.
- Spatial transformer networks. Advances in Neural Information Processing Systems, 28, 2015.
- Exploring spatial-temporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4554–4563, 2020.
- Video pixel networks. In International Conference on Machine Learning, pages 1771–1779, 2017.
- Deep variational bayes filters: Unsupervised learning of state space models from raw data. arXiv Preprint arXiv:1605.06432, 2016.
- Auto-encoding variational bayes. arXiv Preprint arXiv:1312.6114, 2013.
- On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, 1951.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Stochastic adversarial video prediction. arXiv Preprint arXiv:1804.01523, 2018.
- Dual motion gan for future-flow embedded video prediction. In proceedings of the IEEE International Conference on Computer Vision, pages 1744–1752, 2017.
- Transformers are sample-efficient world models. In The Eleventh International Conference on Learning Representations, pages 1–21, 2022.
- State-space models for ecological time-series data: Practical model-fitting. Methods in Ecology and Evolution, 14(1):26–42, 2023.
- Learning real-world robot policies by dreaming. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 7680–7687, 2019.
- Recognizing human actions: a local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition, volume 3, pages 32–36, 2004.
- Very deep convolutional networks for large-scale image recognition. arXiv Preprint arXiv:1409.1556, 2014.
- Generating the future with adversarial transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1020–1028, 2017.
- An uncertain future: Forecasting from static images using variational autoencoders. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 835–851, 2016.
- Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. Advances in Neural Information Processing Systems, 30:1–10, 2017.
- Eidetic 3d lstm: A model for video prediction and beyond. In International Conference on Learning Representations, pages 1–14, 2018.
- Scaling autoregressive video models. In International Conference on Learning Representations, pages 1–24, 2019.
- Future video synthesis with object motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5539–5548, 2020.
- Structure preserving video prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1460–1469, 2018.
- Xi Ye and Guillaume-Alexandre Bilodeau. Video prediction by efficient transformers. Image and Vision Computing, 130:104612, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.