Inter-X: Towards Versatile Human-Human Interaction Analysis
Abstract: The analysis of the ubiquitous human-human interactions is pivotal for understanding humans as social beings. Existing human-human interaction datasets typically suffer from inaccurate body motions, lack of hand gestures and fine-grained textual descriptions. To better perceive and generate human-human interactions, we propose Inter-X, a currently largest human-human interaction dataset with accurate body movements and diverse interaction patterns, together with detailed hand gestures. The dataset includes ~11K interaction sequences and more than 8.1M frames. We also equip Inter-X with versatile annotations of more than 34K fine-grained human part-level textual descriptions, semantic interaction categories, interaction order, and the relationship and personality of the subjects. Based on the elaborate annotations, we propose a unified benchmark composed of 4 categories of downstream tasks from both the perceptual and generative directions. Extensive experiments and comprehensive analysis show that Inter-X serves as a testbed for promoting the development of versatile human-human interaction analysis. Our dataset and benchmark will be publicly available for research purposes.
- Aitviewer. https://eth-ait.github.io/aitviewer/.
- Noitom. https://noitom.com/.
- Optitrack. https://optitrack.com/.
- Renderpeople. https://renderpeople.com/.
- Unpaired motion style transfer from video to animation. ACM Transactions on Graphics (TOG), 39(4):64–1, 2020.
- Language2pose: Natural language grounded pose forecasting. In 3DV, pages 719–728. IEEE, 2019.
- Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. TOG, 41(6):1–19, 2022.
- Gesturediffuclip: Gesture diffusion model with clip latents. ACM Trans. Graph., 2023.
- Circle: Capture in rich contextual environments. In CVPR, pages 21211–21221, 2023.
- Rhythm is a dancer: Music-driven motion synthesis with global structure. arXiv preprint arXiv:2111.12159, 2021.
- Make-an-animation: Large-scale text-conditional 3d human motion generation. arXiv preprint arXiv:2305.09662, 2023.
- Multimodal machine learning: A survey and taxonomy. PAMI, 41(2):423–443, 2018.
- Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In CVPR, pages 8726–8737, 2023.
- Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- Playing for 3d human recovery. arXiv preprint arXiv:2110.07588, 2021.
- Implicit neural representations for variable length human motion generation. In ECCV, pages 356–372. Springer, 2022.
- Executing your commands via motion diffusion in latent space. arXiv preprint arXiv:2212.04048, 2022.
- Channel-wise topology refinement graph convolution for skeleton-based action recognition. In ICCV, pages 13359–13368, 2021.
- Decoupling gcn with dropgraph module for skeleton-based action recognition. In ECCV, pages 536–553. Springer, 2020.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- Interaction transformer for human reaction generation. IEEE Transactions on Multimedia, 2023.
- Mofusion: A framework for denoising-diffusion-based motion synthesis. arXiv preprint arXiv:2212.04495, 2022.
- Automatic personality assessment through movement analysis. Sensors, 22(10):3949, 2022.
- From emotions to mood disorders: A survey on gait analysis methodology. IEEE journal of biomedical and health informatics, 23(6):2302–2316, 2019.
- Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model. arXiv preprint arXiv:2304.08577, 2023.
- c: Towards skeleton-based action recognition in the wild. In ICCV, pages 13634–13644, 2023.
- Perform: Perceptual approach for adding ocean personality to human motion using laban movement analysis. ACM Transactions on Graphics (TOG), 36(1):1–16, 2016.
- Learning to detect and track visible and occluded body joints in a virtual world. In ECCV, pages 430–446, 2018.
- Three-dimensional reconstruction of human interactions. In CVPR, pages 7214–7223, 2020.
- Unified pose sequence modeling. In CVPR, pages 13019–13030, 2023.
- Remos: Reactive 3d motion synthesis for two-person interactions. arXiv preprint arXiv:2311.17057, 2023.
- Linguistic descriptions of human motion with generative adversarial seq2seq learning. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4281–4287. IEEE, 2021.
- Action2motion: Conditioned generation of 3d human motions. In ACM Multimedia, pages 2021–2029. ACM, 2020.
- Generating diverse and natural 3d human motions from text. In CVPR, pages 5152–5161, 2022a.
- Generating diverse and natural 3d human motions from text. In CVPR, pages 5152–5161, 2022b.
- Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In ECCV, pages 580–597. Springer, 2022c.
- Deep multimodal representation learning: A survey. Ieee Access, 7:63373–63394, 2019.
- Multi-person extreme motion prediction. In CVPR, pages 13053–13064, 2022d.
- A motion matching-based framework for controllable gesture synthesis from speech. In SIGGRAPH, pages 1–9, 2022.
- Resolving 3d human pose ambiguities with 3d scene constraints. In ICCV, pages 2282–2292, 2019.
- Stochastic scene-aware motion prediction. In ICCV, pages 11374–11384, 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, pages 6626–6637, 2017.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Motion puzzle: Arbitrary motion style transfer by body part. ACM Transactions on Graphics (TOG), 41(3):1–16, 2022.
- Interactive body part contrast mining for human interaction recognition. In 2014 IEEE international conference on multimedia and expo workshops (ICMEW), pages 1–6. IEEE, 2014.
- A large-scale rgb-d database for arbitrary-view human action recognition. In ACMMM, page 1510–1518, 2018.
- Motiongpt: Human motion as a foreign language. arXiv preprint arXiv:2306.14795, 2023.
- Action-gpt: Leveraging large-scale language models for improved and generalized action generation. In ICME, pages 31–36. IEEE, 2023.
- Flame: Free-form language-based motion synthesis & editing. arXiv preprint arXiv:2209.00349, 2022.
- Dancing to music. NeurIPS, 32, 2019.
- Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In ICCV, pages 10444–10453, 2023a.
- Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In ICCV, pages 10444–10453, 2023b.
- Danceformer: Music conditioned 3d dance generation with parametric motion transformer. AAAI, 36(2):1272–1279, 2022.
- Ai choreographer: Music conditioned 3d dance generation with aist++, 2021.
- Intergen: Diffusion-based multi-human motion generation under complex interactions. arXiv preprint arXiv:2304.05684, 2023.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- Motion-x: A large-scale 3d expressive whole-body human motion dataset. arXiv preprint arXiv:2307.00818, 2023.
- Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. T-PAMI, 42(10):2684–2701, 2019.
- Interactive humanoid: Online full-body motion reaction synthesis with social affordance canonicalization and forecasting. arXiv preprint arXiv:2312.08983, 2023.
- Disentangling and unifying graph convolutions for skeleton-based action recognition. In CVPR, pages 143–152, 2020.
- Humantomato: Text-aligned whole-body motion generation. arXiv preprint arXiv:2310.12978, 2023.
- A contemplated revision of the neo five-factor inventory. Personality and individual differences, 36(3):587–596, 2004.
- You2me: Inferring body pose in egocentric video via first and second person interactions. In CVPR, pages 9890–9900, 2020.
- Igformer: Interaction graph transformer for skeleton-based human interaction recognition. In ECCV, pages 605–622. Springer, 2022.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Expressive body capture: 3d hands, face, and body from a single image. In CVPR, pages 10975–10985, 2019.
- Interaction relational network for mutual action recognition. IEEE Transactions on Multimedia, 24:366–376, 2021.
- Action-conditioned 3d human motion synthesis with transformer vae. In CVPR, pages 10985–10995, 2021.
- TEMOS: Generating diverse human motions from textual descriptions. In ECCV, pages 480–497, 2022.
- The kit motion-language dataset. Big data, 4(4):236–252, 2016.
- Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robotics and Autonomous Systems, 109:13–26, 2018.
- The virtual caliper: Rapid creation of metrically accurate avatars from 3d measurements. IEEE transactions on visualization and computer graphics, 25(5):1887–1897, 2019.
- Babel: bodies, action and behavior with english labels. In CVPR, pages 722–731, 2021.
- Deep gait recognition: A survey. PAMI, 45(1):264–284, 2022.
- Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418, 2023.
- Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In CVPR, pages 12026–12035, 2019.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265. PMLR, 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Grab: A dataset of whole-body human grasping of objects. In ECCV, pages 581–600, 2020.
- Role-aware interaction generation from textual description. In ICCV, pages 15999–16009, 2023.
- Flag3d: A 3d fitness activity dataset with language instruction. In CVPR, pages 22106–22117, 2023.
- Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
- A discriminative key pose sequence model for recognizing human interactions. In ICCV, pages 1729–1736. IEEE, 2011.
- Transflower: Probabilistic Autoregressive Dance Generation With Multimodal Attention. ACM Trans. Graph., 2021.
- Umpm benchmark: A multi-person dataset with synchronized video and motion capture data for evaluation of articulated human motion and interaction. In ICCV Workshops, pages 1264–1269. IEEE, 2011.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
- A survey of personality computing. IEEE Transactions on Affective Computing, 5(3):273–291, 2014.
- A survey on gait recognition. ACM Computing Surveys (CSUR), 51(5):1–35, 2018.
- HUMANISE: Language-conditioned human motion generation in 3d scenes. In NIPS, 2022.
- Jerry S Wiggins. The five-factor model of personality: Theoretical perspectives. Guilford Press, 1996.
- Skeleton-based mutually assisted interacted object localization and human action recognition. IEEE Transactions on Multimedia, 2022.
- Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation. In ICCV, pages 2228–2238, 2023a.
- Multimodal learning with transformers: A survey. PAMI, 2023b.
- Paired recurrent autoencoders for bidirectional translation between robot actions and linguistic descriptions. IEEE Robotics and Automation Letters, 3(4):3441–3448, 2018.
- Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, pages 7444–7452. AAAI Press, 2018.
- Hi4d: 4d instance segmentation of close human interaction. In CVPR, pages 17016–17027, 2023.
- Two-person interaction detection using body-pose features and multiple instance learning. In CVPR, pages 28–35. IEEE, 2012.
- Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022a.
- Semantics-guided neural networks for efficient skeleton-based human action recognition. In CVPR, pages 1112–1121, 2020a.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
- Context aware graph convolution for skeleton-based action recognition. In CVPR, pages 14333–14342, 2020b.
- Couch: towards controllable human-chair interactions. In ECCV, pages 518–535, 2022b.
- Learning discriminative representations for skeleton based action recognition. In CVPR, pages 10608–10617, 2023.
- On the continuity of rotation representations in neural networks. In CVPR, pages 5745–5753. Computer Vision Foundation / IEEE, 2019.
- 3d human shape reconstruction from a polarization image. In ECCV, pages 351–368, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.