Action-conditioned Deep Visual Prediction with RoAM, a new Indoor Human Motion Dataset for Autonomous Robots (2306.15852v1)
Abstract: With the increasing adoption of robots across industries, it is crucial to focus on developing advanced algorithms that enable robots to anticipate, comprehend, and plan their actions effectively in collaboration with humans. We introduce the Robot Autonomous Motion (RoAM) video dataset, which is collected with a custom-made turtlebot3 Burger robot in a variety of indoor environments recording various human motions from the robot's ego-vision. The dataset also includes synchronized records of the LiDAR scan and all control actions taken by the robot as it navigates around static and moving human agents. The unique dataset provides an opportunity to develop and benchmark new visual prediction frameworks that can predict future image frames based on the action taken by the recording agent in partially observable scenarios or cases where the imaging sensor is mounted on a moving platform. We have benchmarked the dataset on our novel deep visual prediction framework called ACPNet where the approximated future image frames are also conditioned on action taken by the robot and demonstrated its potential for incorporating robot dynamics into the video prediction paradigm for mobile robotics and autonomous navigation research.
- A. J. Larrazabal, N. Nieto, V. Peterson, D. H. Milone, and E. Ferrante, “Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis,” Proceedings of the National Academy of Sciences, vol. 117, no. 23, pp. 12 592–12 594, 2020. [Online]. Available: https://www.pnas.org/doi/abs/10.1073/pnas.1919012117
- J. Wang, Y. Chen, S. Hao, W. Feng, and Z. Shen, “Balanced distribution adaptation for transfer learning,” in 2017 IEEE International Conference on Data Mining (ICDM), 2017, pp. 1129–1134.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017.
- R. Villegas, A. Pathak, H. Kannan, D. Erhan, Q. V. Le, and H. Lee, “High fidelity video prediction with large stochastic recurrent neural networks,” in In Proceedings of the Thirty-second Advances in Neural Information Processing Systems, ser. NeurIPS 2019. Curran Associates, Inc., 2019, pp. 81–91.
- X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual motion gan for future-flow embedded video prediction,” in Proceedings of IEEE International Conference on Computer Vision, ser. ICCV 2017, 2017, pp. 1762–1770.
- E. Denton and R. Fergus, “Stochastic video generation with a learned prior,” in Proceedings of the Thirty-fifth International Conference on Machine Learning, ICML 2018, ser. Proceedings of Machine Learning Research, vol. 80. Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 1174–1183. [Online]. Available: http://proceedings.mlr.press/v80/denton18a.html
- M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine, “Stochastic variational video prediction,” in Proceedings of the Sixth International Conference on Learning Representations, ser. ICLR 2018, 2018. [Online]. Available: https://openreview.net/forum?id=rk49Mg-CW
- A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine, “Stochastic adversarial video prediction,” arXiv preprint arXiv:1804.01523, 2018.
- L. Castrejon, N. Ballas, and A. Courville, “Improved conditional vrnns for video prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, ser. ICCV 2019, October 2019.
- H. Gao, H. Xu, Q.-Z. Cai, R. Wang, F. Yu, and T. Darrell, “Disentangling propagation and generation for video prediction,” in In Proc.. of the IEEE/CVF International Conference on Computer Vision, ser. ICCV 2019, October 2019.
- J.-Y. Franceschi, E. Delasalles, M. Chen, S. Lamprier, and P. Gallinari, “Stochastic latent residual video prediction,” in International Conference on Machine Learning. PMLR, 2020, pp. 3233–3246.
- X. Ye and G.-A. Bilodeau, “Vptr: Efficient transformers for video prediction,” in 2022 26th International Conference on Pattern Recognition (ICPR), 2022, pp. 3492–3499.
- M. Sarkar, D. Ghose, and A. Bala, “Decomposing camera and object motion for an improved video sequence prediction,” in NeurIPS 2020 Workshop on Pre-registration in Machine Learning. PMLR, 2021, pp. 358–374.
- R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing motion and content for natural video sequence prediction,” in Proceedings of the Fifth International Conference on Learning Representations, ser. ICLR-2017, Toulon, France, 2017. [Online]. Available: http://arxiv.org/abs/1706.08033
- A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
- Y. Liao, J. Xie, and A. Geiger, “KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,” arXiv preprint arXiv:2109.13410, 2021.
- J. Geyer, Y. Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A. S. Chung, L. Hauswald, V. H. Pham, M. Mühlegg, S. Dorn, T. Fernandez, M. Jänicke, S. Mirashi, C. Savani, M. Sturm, O. Vorobiov, M. Oelker, S. Garreis, and P. Schuberth, “A2D2: Audi Autonomous Driving Dataset,” 2020.
- N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learning of video representations using lstms,” in Proceedings of Thirty-second International Conference on Machine Learning, ser. ICML 2015, Lille, France, 2015, pp. 843–852.
- K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
- H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in 2011 International conference on computer vision. IEEE, 2011, pp. 2556–2563.
- S. H. I. Xingjian, Z. C. H. Wang, D.-Y. Yeung, W.-K. Wong, and W. chun Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in Proceedings of Twenty-nineth Conference on Neural Information Processing Systems, ser. NIPS 2015, Montréal Canada, 2015, pp. 802–810.
- C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: a local svm approach,” in Proceedings of the 17th International Conference on Pattern Recognition, ser. ICPR 2004, vol. 3, 2004, pp. 32–36.
- C. Finn and S. Levine, “Deep visual foresight for planning robot motion,” in Proceedings of IEEE International Conference on Robotics and Automation, ser. ICRA 2017, Singapore, May 2017, pp. 2786–2793.
- P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 4, pp. 743–761, 2011.
- N. Hirose, A. Sadeghian, M. Vázquez, P. Goebel, and S. Savarese, “Gonet: A semi-supervised deep learning approach for traversability estimation,” pp. 3044–3051, 2018.
- M. Sarkar, M. Prabhakar, and D. Ghose, “Avoiding obstacles with geometric constraints on lidar data for autonomous robots,” in Third Congress on Intelligent Systems, S. Kumar, H. Sharma, K. Balachandran, J. H. Kim, and J. C. Bansal, Eds. Singapore: Springer Nature Singapore, 2023, pp. 749–761.
- J. Michels, A. Saxena, and A. Y. Ng, “High speed obstacle avoidance using monocular vision and reinforcement learning,” in Proceedings of the 22nd International Conference on Machine Learning, ser. ICML ’05. New York, NY, USA: Association for Computing Machinery, 2005, p. 593–600. [Online]. Available: https://doi.org/10.1145/1102351.1102426
- A. Chakravarthy and D. Ghose, “Obstacle avoidance in a dynamic environment: a collision cone approach,” IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, vol. 28, no. 5, pp. 562–574, 1998.
- X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W. kin Wong, and W. chun WOO, “Convolutional LSTM network: A machine learning approach for precipitation nowcasting,” in Proceedings of the Twenty-ninth International Conference on Neural Information Processing Systems, ser. NIPS 2015, Montreal, 2015, pp. 802–810.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015.
- T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “Towards accurate generative models of video: A new metric & challenges,” arXiv preprint arXiv:1812.01717, 2018.