Papers
Topics
Authors
Recent
Search
2000 character limit reached

Inter-X: Towards Versatile Human-Human Interaction Analysis

Published 26 Dec 2023 in cs.CV | (2312.16051v1)

Abstract: The analysis of the ubiquitous human-human interactions is pivotal for understanding humans as social beings. Existing human-human interaction datasets typically suffer from inaccurate body motions, lack of hand gestures and fine-grained textual descriptions. To better perceive and generate human-human interactions, we propose Inter-X, a currently largest human-human interaction dataset with accurate body movements and diverse interaction patterns, together with detailed hand gestures. The dataset includes ~11K interaction sequences and more than 8.1M frames. We also equip Inter-X with versatile annotations of more than 34K fine-grained human part-level textual descriptions, semantic interaction categories, interaction order, and the relationship and personality of the subjects. Based on the elaborate annotations, we propose a unified benchmark composed of 4 categories of downstream tasks from both the perceptual and generative directions. Extensive experiments and comprehensive analysis show that Inter-X serves as a testbed for promoting the development of versatile human-human interaction analysis. Our dataset and benchmark will be publicly available for research purposes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (105)
  1. Aitviewer. https://eth-ait.github.io/aitviewer/.
  2. Noitom. https://noitom.com/.
  3. Optitrack. https://optitrack.com/.
  4. Renderpeople. https://renderpeople.com/.
  5. Unpaired motion style transfer from video to animation. ACM Transactions on Graphics (TOG), 39(4):64–1, 2020.
  6. Language2pose: Natural language grounded pose forecasting. In 3DV, pages 719–728. IEEE, 2019.
  7. Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. TOG, 41(6):1–19, 2022.
  8. Gesturediffuclip: Gesture diffusion model with clip latents. ACM Trans. Graph., 2023.
  9. Circle: Capture in rich contextual environments. In CVPR, pages 21211–21221, 2023.
  10. Rhythm is a dancer: Music-driven motion synthesis with global structure. arXiv preprint arXiv:2111.12159, 2021.
  11. Make-an-animation: Large-scale text-conditional 3d human motion generation. arXiv preprint arXiv:2305.09662, 2023.
  12. Multimodal machine learning: A survey and taxonomy. PAMI, 41(2):423–443, 2018.
  13. Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In CVPR, pages 8726–8737, 2023.
  14. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
  15. Playing for 3d human recovery. arXiv preprint arXiv:2110.07588, 2021.
  16. Implicit neural representations for variable length human motion generation. In ECCV, pages 356–372. Springer, 2022.
  17. Executing your commands via motion diffusion in latent space. arXiv preprint arXiv:2212.04048, 2022.
  18. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In ICCV, pages 13359–13368, 2021.
  19. Decoupling gcn with dropgraph module for skeleton-based action recognition. In ECCV, pages 536–553. Springer, 2020.
  20. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  21. Interaction transformer for human reaction generation. IEEE Transactions on Multimedia, 2023.
  22. Mofusion: A framework for denoising-diffusion-based motion synthesis. arXiv preprint arXiv:2212.04495, 2022.
  23. Automatic personality assessment through movement analysis. Sensors, 22(10):3949, 2022.
  24. From emotions to mood disorders: A survey on gait analysis methodology. IEEE journal of biomedical and health informatics, 23(6):2302–2316, 2019.
  25. Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model. arXiv preprint arXiv:2304.08577, 2023.
  26. c: Towards skeleton-based action recognition in the wild. In ICCV, pages 13634–13644, 2023.
  27. Perform: Perceptual approach for adding ocean personality to human motion using laban movement analysis. ACM Transactions on Graphics (TOG), 36(1):1–16, 2016.
  28. Learning to detect and track visible and occluded body joints in a virtual world. In ECCV, pages 430–446, 2018.
  29. Three-dimensional reconstruction of human interactions. In CVPR, pages 7214–7223, 2020.
  30. Unified pose sequence modeling. In CVPR, pages 13019–13030, 2023.
  31. Remos: Reactive 3d motion synthesis for two-person interactions. arXiv preprint arXiv:2311.17057, 2023.
  32. Linguistic descriptions of human motion with generative adversarial seq2seq learning. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4281–4287. IEEE, 2021.
  33. Action2motion: Conditioned generation of 3d human motions. In ACM Multimedia, pages 2021–2029. ACM, 2020.
  34. Generating diverse and natural 3d human motions from text. In CVPR, pages 5152–5161, 2022a.
  35. Generating diverse and natural 3d human motions from text. In CVPR, pages 5152–5161, 2022b.
  36. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In ECCV, pages 580–597. Springer, 2022c.
  37. Deep multimodal representation learning: A survey. Ieee Access, 7:63373–63394, 2019.
  38. Multi-person extreme motion prediction. In CVPR, pages 13053–13064, 2022d.
  39. A motion matching-based framework for controllable gesture synthesis from speech. In SIGGRAPH, pages 1–9, 2022.
  40. Resolving 3d human pose ambiguities with 3d scene constraints. In ICCV, pages 2282–2292, 2019.
  41. Stochastic scene-aware motion prediction. In ICCV, pages 11374–11384, 2021.
  42. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, pages 6626–6637, 2017.
  43. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  44. Motion puzzle: Arbitrary motion style transfer by body part. ACM Transactions on Graphics (TOG), 41(3):1–16, 2022.
  45. Interactive body part contrast mining for human interaction recognition. In 2014 IEEE international conference on multimedia and expo workshops (ICMEW), pages 1–6. IEEE, 2014.
  46. A large-scale rgb-d database for arbitrary-view human action recognition. In ACMMM, page 1510–1518, 2018.
  47. Motiongpt: Human motion as a foreign language. arXiv preprint arXiv:2306.14795, 2023.
  48. Action-gpt: Leveraging large-scale language models for improved and generalized action generation. In ICME, pages 31–36. IEEE, 2023.
  49. Flame: Free-form language-based motion synthesis & editing. arXiv preprint arXiv:2209.00349, 2022.
  50. Dancing to music. NeurIPS, 32, 2019.
  51. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In ICCV, pages 10444–10453, 2023a.
  52. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In ICCV, pages 10444–10453, 2023b.
  53. Danceformer: Music conditioned 3d dance generation with parametric motion transformer. AAAI, 36(2):1272–1279, 2022.
  54. Ai choreographer: Music conditioned 3d dance generation with aist++, 2021.
  55. Intergen: Diffusion-based multi-human motion generation under complex interactions. arXiv preprint arXiv:2304.05684, 2023.
  56. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  57. Motion-x: A large-scale 3d expressive whole-body human motion dataset. arXiv preprint arXiv:2307.00818, 2023.
  58. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. T-PAMI, 42(10):2684–2701, 2019.
  59. Interactive humanoid: Online full-body motion reaction synthesis with social affordance canonicalization and forecasting. arXiv preprint arXiv:2312.08983, 2023.
  60. Disentangling and unifying graph convolutions for skeleton-based action recognition. In CVPR, pages 143–152, 2020.
  61. Humantomato: Text-aligned whole-body motion generation. arXiv preprint arXiv:2310.12978, 2023.
  62. A contemplated revision of the neo five-factor inventory. Personality and individual differences, 36(3):587–596, 2004.
  63. You2me: Inferring body pose in egocentric video via first and second person interactions. In CVPR, pages 9890–9900, 2020.
  64. Igformer: Interaction graph transformer for skeleton-based human interaction recognition. In ECCV, pages 605–622. Springer, 2022.
  65. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  66. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, pages 10975–10985, 2019.
  67. Interaction relational network for mutual action recognition. IEEE Transactions on Multimedia, 24:366–376, 2021.
  68. Action-conditioned 3d human motion synthesis with transformer vae. In CVPR, pages 10985–10995, 2021.
  69. TEMOS: Generating diverse human motions from textual descriptions. In ECCV, pages 480–497, 2022.
  70. The kit motion-language dataset. Big data, 4(4):236–252, 2016.
  71. Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robotics and Autonomous Systems, 109:13–26, 2018.
  72. The virtual caliper: Rapid creation of metrically accurate avatars from 3d measurements. IEEE transactions on visualization and computer graphics, 25(5):1887–1897, 2019.
  73. Babel: bodies, action and behavior with english labels. In CVPR, pages 722–731, 2021.
  74. Deep gait recognition: A survey. PAMI, 45(1):264–284, 2022.
  75. Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418, 2023.
  76. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In CVPR, pages 12026–12035, 2019.
  77. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265. PMLR, 2015.
  78. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  79. Grab: A dataset of whole-body human grasping of objects. In ECCV, pages 581–600, 2020.
  80. Role-aware interaction generation from textual description. In ICCV, pages 15999–16009, 2023.
  81. Flag3d: A 3d fitness activity dataset with language instruction. In CVPR, pages 22106–22117, 2023.
  82. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
  83. A discriminative key pose sequence model for recognizing human interactions. In ICCV, pages 1729–1736. IEEE, 2011.
  84. Transflower: Probabilistic Autoregressive Dance Generation With Multimodal Attention. ACM Trans. Graph., 2021.
  85. Umpm benchmark: A multi-person dataset with synchronized video and motion capture data for evaluation of articulated human motion and interaction. In ICCV Workshops, pages 1264–1269. IEEE, 2011.
  86. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  87. A survey of personality computing. IEEE Transactions on Affective Computing, 5(3):273–291, 2014.
  88. A survey on gait recognition. ACM Computing Surveys (CSUR), 51(5):1–35, 2018.
  89. HUMANISE: Language-conditioned human motion generation in 3d scenes. In NIPS, 2022.
  90. Jerry S Wiggins. The five-factor model of personality: Theoretical perspectives. Guilford Press, 1996.
  91. Skeleton-based mutually assisted interacted object localization and human action recognition. IEEE Transactions on Multimedia, 2022.
  92. Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation. In ICCV, pages 2228–2238, 2023a.
  93. Multimodal learning with transformers: A survey. PAMI, 2023b.
  94. Paired recurrent autoencoders for bidirectional translation between robot actions and linguistic descriptions. IEEE Robotics and Automation Letters, 3(4):3441–3448, 2018.
  95. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, pages 7444–7452. AAAI Press, 2018.
  96. Hi4d: 4d instance segmentation of close human interaction. In CVPR, pages 17016–17027, 2023.
  97. Two-person interaction detection using body-pose features and multiple instance learning. In CVPR, pages 28–35. IEEE, 2012.
  98. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022a.
  99. Semantics-guided neural networks for efficient skeleton-based human action recognition. In CVPR, pages 1112–1121, 2020a.
  100. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  101. Context aware graph convolution for skeleton-based action recognition. In CVPR, pages 14333–14342, 2020b.
  102. Couch: towards controllable human-chair interactions. In ECCV, pages 518–535, 2022b.
  103. Learning discriminative representations for skeleton based action recognition. In CVPR, pages 10608–10617, 2023.
  104. On the continuity of rotation representations in neural networks. In CVPR, pages 5745–5753. Computer Vision Foundation / IEEE, 2019.
  105. 3d human shape reconstruction from a polarization image. In ECCV, pages 351–368, 2020.
Citations (14)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.