Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset (2307.00818v2)
Abstract: In this paper, we present Motion-X, a large-scale 3D expressive whole-body motion dataset. Existing motion datasets predominantly contain body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions. Moreover, they are primarily collected from limited laboratory scenes with textual descriptions manually labeled, which greatly limits their scalability. To overcome these limitations, we develop a whole-body motion and text annotation pipeline, which can automatically annotate motion from either single- or multi-view videos and provide comprehensive semantic labels for each video and fine-grained whole-body pose descriptions for each frame. This pipeline is of high precision, cost-effective, and scalable for further research. Based on it, we construct Motion-X, which comprises 15.6M precise 3D whole-body pose annotations (i.e., SMPL-X) covering 81.1K motion sequences from massive scenes. Besides, Motion-X provides 15.6M frame-level whole-body pose descriptions and 81.1K sequence-level semantic labels. Comprehensive experiments demonstrate the accuracy of the annotation pipeline and the significant benefit of Motion-X in enhancing expressive, diverse, and natural motion generation, as well as 3D whole-body human mesh recovery.
- C. Ahuja and L.-P. Morency, “Language2pose: Natural language grounded pose forecasting,” in 3DV, 2019.
- X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, J. Yu, and G. Yu, “Executing your commands via motion diffusion in latent space,” in CVPR, 2023.
- G. Delmas, P. Weinzaepfel, T. Lucas, F. Moreno-Noguer, and G. Rogez, “Posescript: 3d human poses from natural language,” in ECCV, 2022.
- C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” in CVPR, 2022.
- M. Petrovich, M. J. Black, and G. Varol, “Temos: Generating diverse human motions from textual descriptions,” in ECCV, 2022.
- M. Plappert, C. Mandery, and T. Asfour, “The kit motion-language dataset,” Big data, 2016.
- M. Plappert, C. Mandery, and T. Asfour, “Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks,” Robotics and Autonomous Systems, 2018.
- A. R. Punnakkal, A. Chandrasekaran, N. Athanasiou, A. Quiros-Ramirez, and M. J. Black, “Babel: bodies, action and behavior with english labels,” in CVPR, 2021.
- M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,” arXiv preprint arXiv:2208.15001, 2022.
- M. Zhao, M. Liu, B. Ren, S. Dai, and N. Sebe, “Modiff: Action-conditioned 3d motion generation with denoising diffusion probabilistic models,” arXiv preprint arXiv:2301.03949, 2023.
- N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “Amass: Archive of motion capture as surface shapes,” in ICCV, 2019.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” arXiv preprint arXiv:2006.11239, 2020.
- J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
- G. Tevet, S. Raab, B. Gordon, Y. Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,” in ICLR, 2023.
- J. Lin, A. Zeng, H. Wang, L. Zhang, and Y. Li, “One-stage 3d whole-body mesh recovery with component aware transformer,” in CVPR, pp. 21159–21168, 2023.
- R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “Ai choreographer: Music conditioned 3d dance generation with aist++,” in ICCV, 2021.
- G. Moon, H. Choi, and K. M. Lee, “Accurate 3d hand pose estimation for whole-body 3d human mesh estimation,” in CVPRW, 2020.
- Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “Vitpose: Simple vision transformer baselines for human pose estimation,” in NeurIPS, 2022.
- Y. Yuan, U. Iqbal, P. Molchanov, K. Kitani, and J. Kautz, “Glamr: Global occlusion-aware human mesh recovery with dynamic cameras,” in CVPR, 2022.
- J. Yang, A. Zeng, S. Liu, F. Li, R. Zhang, and L. Zhang, “Explicit box detection unifies end-to-end multi-person pose estimation,” in ICLR, 2023.
- H. E. Pang, Z. Cai, L. Yang, T. Zhang, and Z. Liu, “Benchmarking and analyzing 3d human pose and shape estimation beyond algorithms,” in NeurIPS Datasets and Benchmarks Track, 2022.
- G. Moon, H. Choi, and K. M. Lee, “Neuralannot: Neural annotator for 3d human mesh training sets,” in CVPR, pp. 2299–2307, 2022.
- G. Moon, H. Choi, S. Chun, J. Lee, and S. Yun, “Three recipes for better 3d pseudo-gts of 3d human mesh estimation in the wild,” in CVPR, pp. 2754–2763, 2023.
- H. Yi, H. Liang, Y. Liu, Q. Cao, Y. Wen, T. Bolkart, D. Tao, and M. J. Black, “Generating holistic 3d human motion from speech,” in CVPR, pp. 469–480, 2023.
- G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” in CVPR, 2019.
- Z. Cai, D. Ren, A. Zeng, Z. Lin, T. Yu, W. Wang, X. Fan, Y. Gao, Y. Yu, L. Pan, F. Hong, M. Zhang, C. C. Loy, L. Yang, and Z. Liu, “Humman: Multi-modal 4d human dataset for versatile sensing and modeling,” in ECCV, 2022.
- J. Chung, C.-h. Wuu, H.-r. Yang, Y.-W. Tai, and C.-K. Tang, “Haa500: Human-centric atomic action dataset with curated videos,” in ICCV, 2021.
- J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,” in TPAMI, 2019.
- O. Taheri, N. Ghorbani, M. J. Black, and D. Tzionas, “Grab: A dataset of whole-body human grasping of objects,” in ECCV, 2020.
- S. Tsuchida, S. Fukayama, M. Hamasaki, and M. Goto, “Aist dance video database: Multi-genre, multi-dancer, and multi-camera database for dance information processing.,” in ISMIR, 2019.
- S. Zhalehpour, O. Onder, Z. Akhtar, and C. E. Erdem, “Baum-1: A spontaneous audio-visual face database of affective and mental states,” IEEE Transactions on Affective Computing, 2016.
- S. Zhang, Q. Ma, Y. Zhang, Z. Qian, T. Kwon, M. Pollefeys, F. Bogo, and S. Tang, “Egobody: Human body shape and motion of interacting people from head-mounted devices,” in ECCV, 2022.
- J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A short note on the kinetics-700 human action dataset,” arXiv preprint arXiv:1907.06987, 2019.
- C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al., “Ava: A video dataset of spatio-temporally localized atomic visual actions,” in CVPR, 2018.
- A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in CVPR, 2016.
- N. Trivedi, A. Thatipelli, and R. K. Sarvadevabhatla, “Ntu-x: An enhanced large-scale dataset for improving pose-based recognition of subtle human actions,” arXiv preprint arXiv:2101.11529, 2021.
- M. Hassan, D. Ceylan, R. Villegas, J. Saito, J. Yang, Y. Zhou, and M. J. Black, “Stochastic scene-aware motion prediction,” in ICCV, 2021.
- M. Hassan, V. Choutas, D. Tzionas, and M. J. Black, “Resolving 3d human pose ambiguities with 3d scene constraints,” in ICCV, 2019.
- Y.-L. Li, X. Liu, X. Wu, Y. Li, Z. Qiu, L. Xu, Y. Xu, H.-S. Fang, and C. Lu, “Hake: a knowledge engine foundation for human activity understanding,” in TPAMI, 2022.
- Y. Zheng, Y. Yang, K. Mo, J. Li, T. Yu, Y. Liu, C. K. Liu, and L. J. Guibas, “Gimo: Gaze-informed human motion prediction in context,” in ECCV, 2022.
- C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng, “Action2motion: Conditioned generation of 3d human motions,” in ACM MM, 2020.
- R. Gross and J. Shi, “The cmu motion of body (mobo) database,” 2001.
- C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6M: Large scale datasets and predictive methods for 3d human sensing in natural environments,” in TPAMI, 2014.
- L. Sigal, A. O. Balan, and M. J. Black, “Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion,” IJCV, 2010.
- M. Trumble, A. Gilbert, C. Malleson, A. Hilton, and J. P. Collomosse, “Total capture: 3d human pose estimation fusing video and inertial sensors,” in BMVC, 2017.
- M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,” ACM TOG, 2015.
- Y. Yuan, J. Song, U. Iqbal, A. Vahdat, and J. Kautz, “Physdiff: Physics-guided human motion diffusion model,” arXiv preprint arXiv:2212.02500, 2022.
- J. Zhang, Y. Zhang, X. Cun, S. Huang, Y. Zhang, H. Zhao, H. Lu, and X. Shen, “T2m-gpt: Generating human motion from textual descriptions with discrete representations,” in CVPR, 2023.
- W. Jiao, W. Wang, J.-t. Huang, X. Wang, and Z. Tu, “Is chatgpt a good translator? a preliminary study,” arXiv preprint arXiv:2301.08745, 2023.
- R. Daněček, M. J. Black, and T. Bolkart, “Emoca: Emotion driven monocular face capture and animation,” in CVPR, 2022.
- S. Jin, L. Xu, J. Xu, C. Wang, W. Liu, C. Qian, W. Ouyang, and P. Luo, “Whole-body human pose estimation in the wild,” in ECCV, 2020.
- L. Xu, S. Jin, W. Liu, C. Qian, W. Ouyang, P. Luo, and X. Wang, “Zoomnas: searching for whole-body human pose estimation in the wild,” TPAMI, 2022.
- S. Narasimhaswamy, T. Nguyen, M. Huang, and M. Hoai, “Whose hands are these? hand detection and hand-body association in the wild,” in CVPR, 2022.
- A. Savitzky and M. J. Golay, “Smoothing and differentiation of data by simplified least squares procedures.,” Analytical chemistry, 1964.
- A. Zeng, X. Ju, L. Yang, R. Gao, X. Zhu, B. Dai, and Q. Xu, “Deciwatch: A simple baseline for 10x efficient 2d and 3d pose estimation,” in ECCV, 2022.
- A. Zeng, L. Yang, X. Ju, J. Li, J. Wang, and Q. Xu, “Smoothnet: A plug-and-play network for refining human poses in videos,” in ECCV, 2022.
- I. Sárándi, A. Hermans, and B. Leibe, “Learning 3d human pose estimation from dozens of datasets using a geometry-aware autoencoder to bridge between skeleton formats,” in WACV, 2023.
- F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black, “Keep it smpl: Automatic estimation of 3d human pose and shape from a single image,” in ECCV, 2016.
- S. Shimada, V. Golyanik, W. Xu, and C. Theobalt, “Physcap: Physically plausible monocular 3d motion capture in real time,” ACM ToG, 2020.
- U. Iqbal, K. Xie, Y. Guo, J. Kautz, and P. Molchanov, “Kama: 3d keypoint aware body mesh articulation,” in 3DV, 2021.
- H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction-finetuned visual language model for video understanding,” 2023.
- A. Dutta and A. Zisserman, “The VIA annotation software for images, audio and video,” in ACM MM, 2019.
- A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2017.
- Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in CVPR, 2017.
- F. Zhang, V. Bazarevsky, A. Vakunov, A. Tkachenka, G. Sung, C.-L. Chang, and M. Grundmann, “Mediapipe hands: On-device real-time hand tracking,” arXiv preprint arXiv:2006.10214, 2020.
- K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in CVPR, 2019.
- T. Jiang, P. Lu, L. Zhang, N. Ma, R. Han, C. Lyu, Y. Li, and K. Chen, “Rtmpose: Real-time multi-person pose estimation based on mmpose,” arXiv preprint arXiv:2303.07399, 2023.
- H. Zhang, Y. Tian, Y. Zhang, M. Li, L. An, Z. Sun, and Y. Liu, “Pymaf-x: Towards well-aligned full-body model regression from monocular images,” in TPAMI, 2023.
- C. Guo, X. Zuo, S. Wang, X. Liu, S. Zou, M. Gong, and L. Cheng, “Action2video: Generating videos of human 3d actions,” IJCV, 2022.
- P. Patel, C.-H. P. Huang, J. Tesch, D. T. Hoffmann, S. Tripathi, and M. J. Black, “AGORA: Avatars in geography optimized for regression analysis,” in CVPR, 2021.
- M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in CVPR, 2014.
- S. Yan, Z. Li, Y. Xiong, H. Yan, and D. Lin, “Convolutional sequence generation for skeleton-based action synthesis,” in ICCV, 2019.
- R. Zhao, H. Su, and Q. Ji, “Bayesian adversarial human motion synthesis,” in CVPR, 2020.
- Y. Zhang, M. J. Black, and S. Tang, “Perpetual motion: Generating unbounded human motion,” arXiv preprint arXiv:2007.13886, 2020.
- Y. Cai, Y. Wang, Y. Zhu, T.-J. Cham, J. Cai, J. Yuan, J. Liu, C. Zheng, S. Yan, H. Ding, et al., “A unified 3d human motion synthesis model via conditional variational auto-encoder,” in ICCV, 2021.
- H. Ahn, T. Ha, Y. Choi, H. Yoo, and S. Oh, “Text2action: Generative adversarial synthesis from language to action,” in ICRA, 2018.
- A. Ghosh, N. Cheema, C. Oguz, C. Theobalt, and P. Slusallek, “Synthesis of compositional animations from textual descriptions,” in ICCV, 2021.
- J. Martinez, M. J. Black, and J. Romero, “On human motion prediction using recurrent neural networks,” in CVPR, 2017.
- J. Butepage, M. J. Black, D. Kragic, and H. Kjellstrom, “Deep representation learning for human motion prediction and classification,” in CVPR, 2017.
- Y. Cai, L. Huang, Y. Wang, T.-J. Cham, J. Cai, J. Yuan, J. Liu, X. Yang, Y. Zhu, X. Shen, et al., “Learning progressive joint propagation for human motion prediction,” in ECCV, 2020.
- K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Recurrent network models for human dynamics,” in ICCV, 2015.
- P. Ghosh, J. Song, E. Aksan, and O. Hilliges, “Learning human motion models for long-term predictions,” in 3DV, 2017.
- M. Kaufmann, E. Aksan, J. Song, F. Pece, R. Ziegler, and O. Hilliges, “Convolutional autoencoders for human motion infilling,” in 3DV, 2020.
- W. Mao, M. Liu, M. Salzmann, and H. Li, “Learning trajectory dependencies for human motion prediction,” in ICCV, 2019.
- W. Mao, M. Liu, and M. Salzmann, “History repeats itself: Human motion prediction via motion attention,” in ECCV, 2020.
- Z. Wang, P. Yu, Y. Zhao, R. Zhang, Y. Zhou, J. Yuan, and C. Chen, “Learning diverse stochastic human-action generators by learning smooth latent transitions,” in AAAI, 2020.
- P. Yu, Y. Zhao, C. Li, J. Yuan, and C. Chen, “Structure-aware human-action generation,” in ECCV, 2020.
- M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” in ICCV, 2021.