Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ReGenNet: Towards Human Action-Reaction Synthesis (2403.11882v1)

Published 18 Mar 2024 in cs.CV and cs.AI

Abstract: Humans constantly interact with their surrounding environments. Current human-centric generative models mainly focus on synthesizing humans plausibly interacting with static scenes and objects, while the dynamic human action-reaction synthesis for ubiquitous causal human-human interactions is less explored. Human-human interactions can be regarded as asymmetric with actors and reactors in atomic interaction periods. In this paper, we comprehensively analyze the asymmetric, dynamic, synchronous, and detailed nature of human-human interactions and propose the first multi-setting human action-reaction synthesis benchmark to generate human reactions conditioned on given human actions. To begin with, we propose to annotate the actor-reactor order of the interaction sequences for the NTU120, InterHuman, and Chi3D datasets. Based on them, a diffusion-based generative model with a Transformer decoder architecture called ReGenNet together with an explicit distance-based interaction loss is proposed to predict human reactions in an online manner, where the future states of actors are unavailable to reactors. Quantitative and qualitative results show that our method can generate instant and plausible human reactions compared to the baselines, and can generalize to unseen actor motions and viewpoint changes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Language2pose: Natural language grounded pose forecasting. In 3DV, pages 719–728. IEEE, 2019.
  2. Flag: Flow-based 3d avatar generation from sparse observations. In CVPR, pages 13253–13262, 2022.
  3. Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. TOG, 41(6):1–19, 2022.
  4. Rhythm is a dancer: Music-driven motion synthesis with global structure. arXiv preprint arXiv:2111.12159, 2021.
  5. Belfusion: Latent diffusion for behavior-driven human motion prediction. arXiv preprint arXiv:2211.14304, 2022.
  6. A multimodal predictive agent model for human interaction generation. In CVPR, pages 1022–1023, 2020.
  7. A whole-body pose taxonomy for loco-manipulation tasks. In IROS, pages 1578–1585. IEEE, 2015.
  8. Bodiffusion: Diffusing sparse observations for full-body human motion synthesis. arXiv preprint arXiv:2304.11118, 2023.
  9. Implicit neural representations for variable length human motion generation. In ECCV, pages 356–372. Springer, 2022.
  10. Executing your commands via motion diffusion in latent space. arXiv preprint arXiv:2212.04048, 2022.
  11. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  12. Interaction transformer for human reaction generation. IEEE Transactions on Multimedia, 2023.
  13. Mofusion: A framework for denoising-diffusion-based motion synthesis. arXiv preprint arXiv:2212.04495, 2022.
  14. Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model. arXiv preprint arXiv:2304.08577, 2023.
  15. Three-dimensional reconstruction of human interactions. In CVPR, pages 7214–7223, 2020.
  16. Recurrent network models for human dynamics. In ICCV, pages 4346–4354, 2015.
  17. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
  18. What makes a chair a chair? In CVPR 2011, pages 1529–1536. IEEE, 2011.
  19. Contactopt: Optimizing contact to improve grasps. In CVPR, pages 1471–1481, 2021.
  20. Action2motion: Conditioned generation of 3d human motions. In ACM Multimedia, pages 2021–2029. ACM, 2020.
  21. Generating diverse and natural 3d human motions from text. In CVPR, pages 5152–5161, 2022a.
  22. Multi-person extreme motion prediction. In CVPR, pages 13053–13064, 2022b.
  23. Back to mlp: A simple baseline for human motion prediction. In CVPR, pages 4809–4819, 2023.
  24. From 3d scene geometry to human workspace. In CVPR 2011, pages 1961–1968. IEEE, 2011.
  25. A motion matching-based framework for controllable gesture synthesis from speech. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022.
  26. Stochastic scene-aware motion prediction. In CVPR, pages 11374–11384, 2021a.
  27. Populating 3d scenes by learning human-scene interaction. In CVPR, pages 14708–14718, 2021b.
  28. Human motion prediction via spatio-temporal inpainting. In CVPR, pages 7134–7143, 2019.
  29. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, pages 6626–6637, 2017.
  30. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  31. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  32. Imitation learning of whole-body grasps. In 2006 IEEE/RSJ international conference on intelligent robots and systems, pages 5657–5662. IEEE, 2006.
  33. Predictive and generative neural networks for object functionality. arXiv preprint arXiv:2006.15520, 2020.
  34. Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. TOG, 37(6):1–15, 2018.
  35. Avatarposer: Articulated full-body pose tracking from sparse motion sensing. In ECCV, pages 443–460. Springer, 2022.
  36. Grasping field: Learning implicit representations for human grasps. In 3DV, pages 333–344. IEEE, 2020.
  37. Flame: Free-form language-based motion synthesis & editing. arXiv preprint arXiv:2209.00349, 2022.
  38. Shape2pose: Human-centric shape analysis. TOG, 33(4):1–12, 2014.
  39. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  40. Dancing to music. NeurIPS, 32, 2019.
  41. Danceformer: Music conditioned 3d dance generation with parametric motion transformer. AAAI, 36(2):1272–1279, 2022.
  42. Ai choreographer: Music conditioned 3d dance generation with aist++. In CVPR, pages 13401–13412, 2021.
  43. Putting humans in a scene: Learning affordance in 3d indoor environments. In CVPR, pages 12368–12376, 2019.
  44. Intergen: Diffusion-based multi-human motion generation under complex interactions. arXiv preprint arXiv:2304.05684, 2023.
  45. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. T-PAMI, 42(10):2684–2701, 2019.
  46. SMPL: a skinned multi-person linear model. ACM Trans. Graph., 34(6):248:1–248:16, 2015.
  47. Weakly-supervised action transition learning for stochastic human motion prediction. In CVPR, pages 8151–8160, 2022.
  48. On human motion prediction using recurrent neural networks. In CVPR, pages 2891–2900, 2017.
  49. You2me: Inferring body pose in egocentric video via first and second person interactions. In CVPR, pages 9890–9900, 2020.
  50. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, pages 10975–10985, 2019.
  51. Action-conditioned 3d human motion synthesis with transformer vae. In CVPR, pages 10985–10995, 2021.
  52. Temos: Generating diverse human motions from textual descriptions. In ECCV, pages 480–497. Springer, 2022.
  53. Modi: Unconditional motion synthesis from diverse data. arXiv preprint arXiv:2206.08010, 2022.
  54. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  55. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015.
  56. Scenegrok: Inferring action maps in 3d environments. TOG, 33(6):1–10, 2014.
  57. Pigraphs: learning interaction snapshots from observations. ACM Transactions on Graphics (TOG), 35(4):1–12, 2016.
  58. Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418, 2023.
  59. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265. PMLR, 2015.
  60. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  61. Local motion phases for learning multi-contact character movements. TOG, 39(4):54–1, 2020.
  62. Neural animation layering for synthesizing martial arts movements. TOG, 40(4):1–16, 2021.
  63. Grab: A dataset of whole-body human grasping of objects. In ECCV, pages 581–600. Springer, 2020.
  64. Goal: Generating 4d whole-body motion for hand-object grasping. In CVPR, pages 13263–13273, 2022.
  65. Role-aware interaction generation from textual description. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15999–16009, 2023.
  66. Predicting human poses via recurrent attention network. Visual Intelligence, 1(1):18, 2023.
  67. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
  68. Ntu-x: an enhanced large-scale dataset for improving pose-based recognition of subtle human actions. In Proceedings of the Twelfth Indian Conference on Computer Vision, Graphics and Image Processing, pages 1–9, 2021.
  69. Umpm benchmark: A multi-person dataset with synchronized video and motion capture data for evaluation of articulated human motion and interaction. In ICCV Workshops, pages 1264–1269. IEEE, 2011.
  70. Spatio-temporal detection of fine-grained dyadic human interactions. In Human Behavior Understanding: 7th International Workshop, HBU 2016, Amsterdam, The Netherlands, October 16, 2016, Proceedings 7, pages 116–133. Springer, 2016.
  71. Synthesizing long-term 3d human motion and interaction in 3d scenes. In CVPR, pages 9401–9411, 2021.
  72. Saga: Stochastic whole-body grasping with contact. In ECCV, pages 257–274. Springer, 2022.
  73. Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation. arXiv e-prints, pages arXiv–2203, 2022.
  74. Inter-x: Towards versatile human-human interaction analysis. arXiv preprint arXiv:2312.16051, 2023.
  75. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, pages 7444–7452. AAAI Press, 2018.
  76. Convolutional sequence generation for skeleton-based action synthesis. In ICCV, pages 4393–4401. IEEE, 2019.
  77. A survey on generative 3d digital humans based on neural networks: representation, rendering, and learning. SCIENTIA SINICA Informationis, pages 1858–, 2023.
  78. Two-person interaction detection using body-pose features and multiple instance learning. In CVPRW, pages 28–35. IEEE, 2012.
  79. Pymaf-x: Towards well-aligned full-body model regression from monocular images. arXiv preprint arXiv:2207.06400, 2022a.
  80. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022b.
  81. Place: Proximity learning of articulation and contact in 3d environments. In 3DV, pages 642–651. IEEE, 2020a.
  82. Generating 3d people in scenes without people. In CVPR, 2020b.
  83. Compositional human-scene interaction synthesis with semantic control. In ECCV, pages 311–327. Springer, 2022.
  84. Toch: Spatio-temporal object-to-hand correspondence for motion refinement. In ECCV, pages 1–19. Springer, 2022.
  85. On the continuity of rotation representations in neural networks. In CVPR, pages 5745–5753. Computer Vision Foundation / IEEE, 2019.
Citations (4)

Summary

We haven't generated a summary for this paper yet.