ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions (2311.17057v3)
Abstract: Current approaches for 3D human motion synthesis generate high quality animations of digital humans performing a wide variety of actions and gestures. However, a notable technological gap exists in addressing the complex dynamics of multi human interactions within this paradigm. In this work, we present ReMoS, a denoising diffusion based model that synthesizes full body reactive motion of a person in a two person interaction scenario. Given the motion of one person, we employ a combined spatio temporal cross attention mechanism to synthesize the reactive body and hand motion of the second person, thereby completing the interactions between the two. We demonstrate ReMoS across challenging two person scenarios such as pair dancing, Ninjutsu, kickboxing, and acrobatics, where one persons movements have complex and diverse influences on the other. We also contribute the ReMoCap dataset for two person interactions containing full body and finger motions. We evaluate ReMoS through multiple quantitative metrics, qualitative visualizations, and a user study, and also indicate usability in interactive motion editing applications.
- Star-transformer: A spatio-temporal cross attention transformer for human action recognition. In Winter Conference on Applications of Computer Vision (WACV), 2023.
- Gesturediffuclip: Gesture diffusion model with clip latents. arXiv preprint arXiv:2303.14613, 2023.
- Rhythm is a dancer: Music-driven motion synthesis with global structure. IEEE Transactions on Visualization and Computer Graphics (TVCG), 2022.
- Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning. In Proceedings of the 29th ACM International Conference on Multimedia, 2021.
- Understanding batch normalization. In Advances in Neural Information Processing Systems, 2018.
- https://captury.com, 2023.
- A virtual reality dance training system using motion capture technology. IEEE transactions on learning technologies, 4(2), 2010.
- Bipartite graph diffusion model for human interaction generation. arXiv preprint arXiv:2301.10134, 2023.
- Interaction transformer for human reaction generation. IEEE Transactions on Multimedia, 2023.
- Antony Cummins. In search of the ninja: the historical truth of ninjutsu. The History Press, 2012.
- Mofusion: A framework for denoising-diffusion-based motion synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Presence and interaction in mixed reality environments. The Visual Computer, 2007.
- Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 2018.
- Three-dimensional reconstruction of human interactions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Synthesis of compositional animations from textual descriptions. In International Conference on Computer Vision (ICCV), 2021.
- Imos: Intent-driven full-body motion synthesis for human-object interactions. In Computer Graphics Forum, volume 42. Wiley Online Library, 2023.
- Interaction Mix and Match: Synthesizing Close Interaction using Conditional Hierarchical GAN with Multi-Hot Class Embedding. Computer Graphics Forum, 2022.
- Dancing with the virtual dervish: Virtual bodies. In Virtual Reality Software and Technology. World Scientific, 1994.
- Multi-person extreme motion prediction. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- A motion matching-based framework for controllable gesture synthesis from speech. In ACM SIGGRAPH Conference Proceedings, 2022.
- Scenemaker: Intelligent multimodal visualisation of natural language scripts. In Irish Conference on Artificial Intelligence and Cognitive Science. Springer, 2009.
- Stochastic scene-aware motion prediction. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Planning tangling motions for humanoids. In IEEE-RAS International Conference on Humanoid Robots, 2007.
- Character motion synthesis by topology coordinates. In Computer graphics forum. Wiley Online Library, 2009.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 2020.
- Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1), 2022.
- Efficient interaction recognition through positive action representation. Mathematical Problems in Engineering, 2013.
- Genre-conditioned long-term 3d dance generation driven by music. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.
- Guided motion diffusion for controllable human motion synthesis. In International Conference on Computer Vision (ICCV), 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Animating reactive motion using momentum-based inverse kinematics. Computer Animation and Virtual Worlds, 2005.
- Nifty: Neural object interaction fields for guided human motion synthesis. arXiv preprint arXiv:2307.07511, 2023.
- Cross-conditioned recurrent networks for long-term synthesis of inter-person human motion interactions. In Winter Conference on Applications of Computer Vision (WACV), 2020.
- Intergen: Diffusion-based multi-human motion generation under complex interactions. arXiv preprint arXiv:2304.05684, 2023.
- Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, 2019.
- Gan-based reactive motion synthesis with class-aware discriminators for human–human interaction. Computers & Graphics, 2022.
- Christos Mousas. Performance-driven dance motion control of a virtual partner character. In IEEE Conference on Virtual Reality and 3D User Interfaces (VR), 2018.
- State of the art on diffusion models for visual computing. arXiv pre-prints, 2023.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Salsa dance learning evaluation and motion analysis in gamified virtual reality environment. Multimedia Tools and Applications, 79, 2020.
- Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418, 2023.
- Interaction-based human activity comparison. IEEE Transactions on Visualization and Computer Graphics, 2020.
- Neural monocular 3d human motion capture with physical awareness. ACM Transactions on Graphics (ToG), 2021.
- Physcap: Physically plausible monocular 3d motion capture in real time. ACM Transactions on Graphics (ToG), 2020.
- Interaction patches for multi-character animation. ACM transactions on graphics (TOG), 27(5), 2008.
- Simulating competitive interactions using singly captured motions. In Proceedings of ACM symposium on Virtual reality software and technology, 2007.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Howard Spring. Swing and the lindy hop: dance, venue, media, and tradition. American Music, 1997.
- Deepphase: Periodic autoencoders for learning motion phase manifolds. ACM Transactions on Graphics (TOG), 2022.
- Local motion phases for learning multi-contact character movements. ACM Transactions on Graphics (TOG), 2020.
- GOAL: Generating 4D whole-body motion for hand-object grasping. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
- Edge: Editable dance generation from music. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Attention is all you need. Advances in neural information processing systems, 2017.
- Omnicontrol: Control any joint at any time for human motion generation. arXiv preprint arXiv:2310.08580, 2023.
- Interdiff: Generating 3d human-object interactions with physics-informed diffusion. In International Conference on Computer Vision (ICCV), 2023.
- Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In International Conference on Robotics and Automation (ICRA). IEEE, 2019.
- Two-person interaction detection using body-pose features and multiple instance learning. In IEEE computer society conference on computer vision and pattern recognition workshops, 2012.
- Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
- Roam: Robust and object-aware motion generation using neural pose descriptors. arXiv:2308.12969, 2023.
- Martial arts, dancing and sports dataset: A challenging stereo and multi-view dataset for 3d human pose estimation. Image and Vision Computing, 61, 2017.