Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset (2307.00818v2)

Published 3 Jul 2023 in cs.CV

Abstract: In this paper, we present Motion-X, a large-scale 3D expressive whole-body motion dataset. Existing motion datasets predominantly contain body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions. Moreover, they are primarily collected from limited laboratory scenes with textual descriptions manually labeled, which greatly limits their scalability. To overcome these limitations, we develop a whole-body motion and text annotation pipeline, which can automatically annotate motion from either single- or multi-view videos and provide comprehensive semantic labels for each video and fine-grained whole-body pose descriptions for each frame. This pipeline is of high precision, cost-effective, and scalable for further research. Based on it, we construct Motion-X, which comprises 15.6M precise 3D whole-body pose annotations (i.e., SMPL-X) covering 81.1K motion sequences from massive scenes. Besides, Motion-X provides 15.6M frame-level whole-body pose descriptions and 81.1K sequence-level semantic labels. Comprehensive experiments demonstrate the accuracy of the annotation pipeline and the significant benefit of Motion-X in enhancing expressive, diverse, and natural motion generation, as well as 3D whole-body human mesh recovery.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. C. Ahuja and L.-P. Morency, “Language2pose: Natural language grounded pose forecasting,” in 3DV, 2019.
  2. X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, J. Yu, and G. Yu, “Executing your commands via motion diffusion in latent space,” in CVPR, 2023.
  3. G. Delmas, P. Weinzaepfel, T. Lucas, F. Moreno-Noguer, and G. Rogez, “Posescript: 3d human poses from natural language,” in ECCV, 2022.
  4. C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” in CVPR, 2022.
  5. M. Petrovich, M. J. Black, and G. Varol, “Temos: Generating diverse human motions from textual descriptions,” in ECCV, 2022.
  6. M. Plappert, C. Mandery, and T. Asfour, “The kit motion-language dataset,” Big data, 2016.
  7. M. Plappert, C. Mandery, and T. Asfour, “Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks,” Robotics and Autonomous Systems, 2018.
  8. A. R. Punnakkal, A. Chandrasekaran, N. Athanasiou, A. Quiros-Ramirez, and M. J. Black, “Babel: bodies, action and behavior with english labels,” in CVPR, 2021.
  9. M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,” arXiv preprint arXiv:2208.15001, 2022.
  10. M. Zhao, M. Liu, B. Ren, S. Dai, and N. Sebe, “Modiff: Action-conditioned 3d motion generation with denoising diffusion probabilistic models,” arXiv preprint arXiv:2301.03949, 2023.
  11. N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “Amass: Archive of motion capture as surface shapes,” in ICCV, 2019.
  12. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” arXiv preprint arXiv:2006.11239, 2020.
  13. J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
  14. G. Tevet, S. Raab, B. Gordon, Y. Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,” in ICLR, 2023.
  15. J. Lin, A. Zeng, H. Wang, L. Zhang, and Y. Li, “One-stage 3d whole-body mesh recovery with component aware transformer,” in CVPR, pp. 21159–21168, 2023.
  16. R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “Ai choreographer: Music conditioned 3d dance generation with aist++,” in ICCV, 2021.
  17. G. Moon, H. Choi, and K. M. Lee, “Accurate 3d hand pose estimation for whole-body 3d human mesh estimation,” in CVPRW, 2020.
  18. Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “Vitpose: Simple vision transformer baselines for human pose estimation,” in NeurIPS, 2022.
  19. Y. Yuan, U. Iqbal, P. Molchanov, K. Kitani, and J. Kautz, “Glamr: Global occlusion-aware human mesh recovery with dynamic cameras,” in CVPR, 2022.
  20. J. Yang, A. Zeng, S. Liu, F. Li, R. Zhang, and L. Zhang, “Explicit box detection unifies end-to-end multi-person pose estimation,” in ICLR, 2023.
  21. H. E. Pang, Z. Cai, L. Yang, T. Zhang, and Z. Liu, “Benchmarking and analyzing 3d human pose and shape estimation beyond algorithms,” in NeurIPS Datasets and Benchmarks Track, 2022.
  22. G. Moon, H. Choi, and K. M. Lee, “Neuralannot: Neural annotator for 3d human mesh training sets,” in CVPR, pp. 2299–2307, 2022.
  23. G. Moon, H. Choi, S. Chun, J. Lee, and S. Yun, “Three recipes for better 3d pseudo-gts of 3d human mesh estimation in the wild,” in CVPR, pp. 2754–2763, 2023.
  24. H. Yi, H. Liang, Y. Liu, Q. Cao, Y. Wen, T. Bolkart, D. Tao, and M. J. Black, “Generating holistic 3d human motion from speech,” in CVPR, pp. 469–480, 2023.
  25. G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” in CVPR, 2019.
  26. Z. Cai, D. Ren, A. Zeng, Z. Lin, T. Yu, W. Wang, X. Fan, Y. Gao, Y. Yu, L. Pan, F. Hong, M. Zhang, C. C. Loy, L. Yang, and Z. Liu, “Humman: Multi-modal 4d human dataset for versatile sensing and modeling,” in ECCV, 2022.
  27. J. Chung, C.-h. Wuu, H.-r. Yang, Y.-W. Tai, and C.-K. Tang, “Haa500: Human-centric atomic action dataset with curated videos,” in ICCV, 2021.
  28. J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,” in TPAMI, 2019.
  29. O. Taheri, N. Ghorbani, M. J. Black, and D. Tzionas, “Grab: A dataset of whole-body human grasping of objects,” in ECCV, 2020.
  30. S. Tsuchida, S. Fukayama, M. Hamasaki, and M. Goto, “Aist dance video database: Multi-genre, multi-dancer, and multi-camera database for dance information processing.,” in ISMIR, 2019.
  31. S. Zhalehpour, O. Onder, Z. Akhtar, and C. E. Erdem, “Baum-1: A spontaneous audio-visual face database of affective and mental states,” IEEE Transactions on Affective Computing, 2016.
  32. S. Zhang, Q. Ma, Y. Zhang, Z. Qian, T. Kwon, M. Pollefeys, F. Bogo, and S. Tang, “Egobody: Human body shape and motion of interacting people from head-mounted devices,” in ECCV, 2022.
  33. J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A short note on the kinetics-700 human action dataset,” arXiv preprint arXiv:1907.06987, 2019.
  34. C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al., “Ava: A video dataset of spatio-temporally localized atomic visual actions,” in CVPR, 2018.
  35. A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in CVPR, 2016.
  36. N. Trivedi, A. Thatipelli, and R. K. Sarvadevabhatla, “Ntu-x: An enhanced large-scale dataset for improving pose-based recognition of subtle human actions,” arXiv preprint arXiv:2101.11529, 2021.
  37. M. Hassan, D. Ceylan, R. Villegas, J. Saito, J. Yang, Y. Zhou, and M. J. Black, “Stochastic scene-aware motion prediction,” in ICCV, 2021.
  38. M. Hassan, V. Choutas, D. Tzionas, and M. J. Black, “Resolving 3d human pose ambiguities with 3d scene constraints,” in ICCV, 2019.
  39. Y.-L. Li, X. Liu, X. Wu, Y. Li, Z. Qiu, L. Xu, Y. Xu, H.-S. Fang, and C. Lu, “Hake: a knowledge engine foundation for human activity understanding,” in TPAMI, 2022.
  40. Y. Zheng, Y. Yang, K. Mo, J. Li, T. Yu, Y. Liu, C. K. Liu, and L. J. Guibas, “Gimo: Gaze-informed human motion prediction in context,” in ECCV, 2022.
  41. C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng, “Action2motion: Conditioned generation of 3d human motions,” in ACM MM, 2020.
  42. R. Gross and J. Shi, “The cmu motion of body (mobo) database,” 2001.
  43. C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6M: Large scale datasets and predictive methods for 3d human sensing in natural environments,” in TPAMI, 2014.
  44. L. Sigal, A. O. Balan, and M. J. Black, “Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion,” IJCV, 2010.
  45. M. Trumble, A. Gilbert, C. Malleson, A. Hilton, and J. P. Collomosse, “Total capture: 3d human pose estimation fusing video and inertial sensors,” in BMVC, 2017.
  46. M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,” ACM TOG, 2015.
  47. Y. Yuan, J. Song, U. Iqbal, A. Vahdat, and J. Kautz, “Physdiff: Physics-guided human motion diffusion model,” arXiv preprint arXiv:2212.02500, 2022.
  48. J. Zhang, Y. Zhang, X. Cun, S. Huang, Y. Zhang, H. Zhao, H. Lu, and X. Shen, “T2m-gpt: Generating human motion from textual descriptions with discrete representations,” in CVPR, 2023.
  49. W. Jiao, W. Wang, J.-t. Huang, X. Wang, and Z. Tu, “Is chatgpt a good translator? a preliminary study,” arXiv preprint arXiv:2301.08745, 2023.
  50. R. Daněček, M. J. Black, and T. Bolkart, “Emoca: Emotion driven monocular face capture and animation,” in CVPR, 2022.
  51. S. Jin, L. Xu, J. Xu, C. Wang, W. Liu, C. Qian, W. Ouyang, and P. Luo, “Whole-body human pose estimation in the wild,” in ECCV, 2020.
  52. L. Xu, S. Jin, W. Liu, C. Qian, W. Ouyang, P. Luo, and X. Wang, “Zoomnas: searching for whole-body human pose estimation in the wild,” TPAMI, 2022.
  53. S. Narasimhaswamy, T. Nguyen, M. Huang, and M. Hoai, “Whose hands are these? hand detection and hand-body association in the wild,” in CVPR, 2022.
  54. A. Savitzky and M. J. Golay, “Smoothing and differentiation of data by simplified least squares procedures.,” Analytical chemistry, 1964.
  55. A. Zeng, X. Ju, L. Yang, R. Gao, X. Zhu, B. Dai, and Q. Xu, “Deciwatch: A simple baseline for 10x efficient 2d and 3d pose estimation,” in ECCV, 2022.
  56. A. Zeng, L. Yang, X. Ju, J. Li, J. Wang, and Q. Xu, “Smoothnet: A plug-and-play network for refining human poses in videos,” in ECCV, 2022.
  57. I. Sárándi, A. Hermans, and B. Leibe, “Learning 3d human pose estimation from dozens of datasets using a geometry-aware autoencoder to bridge between skeleton formats,” in WACV, 2023.
  58. F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black, “Keep it smpl: Automatic estimation of 3d human pose and shape from a single image,” in ECCV, 2016.
  59. S. Shimada, V. Golyanik, W. Xu, and C. Theobalt, “Physcap: Physically plausible monocular 3d motion capture in real time,” ACM ToG, 2020.
  60. U. Iqbal, K. Xie, Y. Guo, J. Kautz, and P. Molchanov, “Kama: 3d keypoint aware body mesh articulation,” in 3DV, 2021.
  61. H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction-finetuned visual language model for video understanding,” 2023.
  62. A. Dutta and A. Zisserman, “The VIA annotation software for images, audio and video,” in ACM MM, 2019.
  63. A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2017.
  64. Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in CVPR, 2017.
  65. F. Zhang, V. Bazarevsky, A. Vakunov, A. Tkachenka, G. Sung, C.-L. Chang, and M. Grundmann, “Mediapipe hands: On-device real-time hand tracking,” arXiv preprint arXiv:2006.10214, 2020.
  66. K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in CVPR, 2019.
  67. T. Jiang, P. Lu, L. Zhang, N. Ma, R. Han, C. Lyu, Y. Li, and K. Chen, “Rtmpose: Real-time multi-person pose estimation based on mmpose,” arXiv preprint arXiv:2303.07399, 2023.
  68. H. Zhang, Y. Tian, Y. Zhang, M. Li, L. An, Z. Sun, and Y. Liu, “Pymaf-x: Towards well-aligned full-body model regression from monocular images,” in TPAMI, 2023.
  69. C. Guo, X. Zuo, S. Wang, X. Liu, S. Zou, M. Gong, and L. Cheng, “Action2video: Generating videos of human 3d actions,” IJCV, 2022.
  70. P. Patel, C.-H. P. Huang, J. Tesch, D. T. Hoffmann, S. Tripathi, and M. J. Black, “AGORA: Avatars in geography optimized for regression analysis,” in CVPR, 2021.
  71. M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in CVPR, 2014.
  72. S. Yan, Z. Li, Y. Xiong, H. Yan, and D. Lin, “Convolutional sequence generation for skeleton-based action synthesis,” in ICCV, 2019.
  73. R. Zhao, H. Su, and Q. Ji, “Bayesian adversarial human motion synthesis,” in CVPR, 2020.
  74. Y. Zhang, M. J. Black, and S. Tang, “Perpetual motion: Generating unbounded human motion,” arXiv preprint arXiv:2007.13886, 2020.
  75. Y. Cai, Y. Wang, Y. Zhu, T.-J. Cham, J. Cai, J. Yuan, J. Liu, C. Zheng, S. Yan, H. Ding, et al., “A unified 3d human motion synthesis model via conditional variational auto-encoder,” in ICCV, 2021.
  76. H. Ahn, T. Ha, Y. Choi, H. Yoo, and S. Oh, “Text2action: Generative adversarial synthesis from language to action,” in ICRA, 2018.
  77. A. Ghosh, N. Cheema, C. Oguz, C. Theobalt, and P. Slusallek, “Synthesis of compositional animations from textual descriptions,” in ICCV, 2021.
  78. J. Martinez, M. J. Black, and J. Romero, “On human motion prediction using recurrent neural networks,” in CVPR, 2017.
  79. J. Butepage, M. J. Black, D. Kragic, and H. Kjellstrom, “Deep representation learning for human motion prediction and classification,” in CVPR, 2017.
  80. Y. Cai, L. Huang, Y. Wang, T.-J. Cham, J. Cai, J. Yuan, J. Liu, X. Yang, Y. Zhu, X. Shen, et al., “Learning progressive joint propagation for human motion prediction,” in ECCV, 2020.
  81. K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Recurrent network models for human dynamics,” in ICCV, 2015.
  82. P. Ghosh, J. Song, E. Aksan, and O. Hilliges, “Learning human motion models for long-term predictions,” in 3DV, 2017.
  83. M. Kaufmann, E. Aksan, J. Song, F. Pece, R. Ziegler, and O. Hilliges, “Convolutional autoencoders for human motion infilling,” in 3DV, 2020.
  84. W. Mao, M. Liu, M. Salzmann, and H. Li, “Learning trajectory dependencies for human motion prediction,” in ICCV, 2019.
  85. W. Mao, M. Liu, and M. Salzmann, “History repeats itself: Human motion prediction via motion attention,” in ECCV, 2020.
  86. Z. Wang, P. Yu, Y. Zhao, R. Zhang, Y. Zhou, J. Yuan, and C. Chen, “Learning diverse stochastic human-action generators by learning smooth latent transitions,” in AAAI, 2020.
  87. P. Yu, Y. Zhao, C. Li, J. Yuan, and C. Chen, “Structure-aware human-action generation,” in ECCV, 2020.
  88. M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” in ICCV, 2021.
Citations (67)

Summary

We haven't generated a summary for this paper yet.