Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Action-conditioned video data improves predictability (2404.05439v1)

Published 8 Apr 2024 in cs.CV

Abstract: Long-term video generation and prediction remain challenging tasks in computer vision, particularly in partially observable scenarios where cameras are mounted on moving platforms. The interaction between observed image frames and the motion of the recording agent introduces additional complexities. To address these issues, we introduce the Action-Conditioned Video Generation (ACVG) framework, a novel approach that investigates the relationship between actions and generated image frames through a deep dual Generator-Actor architecture. ACVG generates video sequences conditioned on the actions of robots, enabling exploration and analysis of how vision and action mutually influence one another in dynamic environments. We evaluate the framework's effectiveness on an indoor robot motion dataset which consists of sequences of image frames along with the sequences of actions taken by the robotic agent, conducting a comprehensive empirical study comparing ACVG to other state-of-the-art frameworks along with a detailed ablation study.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Stochastic variational video prediction. In Proceedings of the Sixth International Conference on Learning Representations, 2018.
  2. Prediction, cognition and the brain. Frontiers in Human Neuroscience, 4:25, 2010.
  3. Improved conditional vrnns for video prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
  4. Stochastic video generation with a learned prior. In Proceedings of the Thirty-fifth International Conference on Machine Learning, ICML 2018, pages 1174–1183, Stockholm Sweden, 2018. PMLR.
  5. An image is worth 16x16 words: Transformers for image recognition at scale. 2021.
  6. Jakob et. al. A2D2: Audi Autonomous Driving Dataset. 2020a.
  7. Kaiser et. al. Model-based reinforcement learning for atari. In Proceedings of Eighth International Conference on Learning Representations, Virtual Conference, Formerly Addis Ababa Ethiopia, 2020b.
  8. Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  9. Deep visual foresight for planning robot motion. In Proceedings of IEEE International Conference on Robotics and Automation, pages 2786–2793, Singapore, 2017.
  10. Unsupervised learning for physical interaction through video prediction. In Proceedings of Thirtieth Conference on Neural Information Processing Systems, pages 64–72, Barcelona, Spain, 2016.
  11. Stochastic latent residual video prediction. In International Conference on Machine Learning, pages 3233–3246. PMLR, 2020.
  12. Disentangling propagation and generation for video prediction. In Proc.. of the IEEE/CVF International Conference on Computer Vision, 2019a.
  13. Disentangling propagation and generation for video prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019b.
  14. Simvp: Simpler yet better video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3170–3180, 2022.
  15. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
  16. Alex Graves. Long Short-Term Memory, pages 37–45. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
  17. Learning latent dynamics for planning from pixels. In ICML 2019, pages 2555–2565, Long Beach, California, USA, 2019.
  18. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.
  19. Stochastic adversarial video prediction. arXiv:1804.01523, 2018.
  20. Dual motion gan for future-flow embedded video prediction. In Proceedings of IEEE International Conference on Computer Vision, pages 1762–1770, 2017.
  21. KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. arXiv preprint arXiv:2109.13410, 2021.
  22. Deep multi-scale video prediction beyond mean square error. In Proceedings of the Fouth International Conference on Learning Representations, San Juan, Puerto Rico, 2016.
  23. Action-conditional video prediction using deep networks in atari games. In Proceedings of the Twenty-ninth International Conference on Neural Information Processing Systems, pages 2863–2871, Montreal, Canada, 2015.
  24. Decomposing camera and object motion for an improved video sequence prediction. In NeurIPS 2020 Workshop on Pre-registration in Machine Learning, pages 358–374. PMLR, 2021.
  25. Action-conditioned deep visual prediction with roam, a new indoor human motion dataset for autonomous robots. In 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pages 1115–1120, 2023.
  26. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the Twenty-ninth International Conference on Neural Information Processing Systems, pages 802–810, Montreal, 2015.
  27. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015.
  28. Unsupervised learning of video representations using lstms. In Proceedings of Thirty-second International Conference on Machine Learning, pages 843–852, Lille, France, 2015.
  29. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  30. Decomposing motion and content for natural video sequence prediction. In Proceedings of the Fifth International Conference on Learning Representations, Toulon, France, 2017.
  31. High fidelity video prediction with large stochastic recurrent neural networks. In In Proceedings of the Thirty-second Advances in Neural Information Processing Systems, pages 81–91. Curran Associates, Inc., 2019.
  32. Anticipating visual representations from unlabeled video. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition CVPR 2016, pages 98–106, 2016.
  33. Hierarchical long-term video prediction without supervision. In ICML 2018, pages 6038–6046, Sweden, 2018.
  34. Video prediction via selective sampling. In Proceedings of the Thirty-second Conference on Neural Information Processing Systems, pages 1705–1715, Montreal, Canada, 2018.
  35. Vptr: Efficient transformers for video prediction. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 3492–3499, 2022.
  36. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.

Summary

We haven't generated a summary for this paper yet.