Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

in2IN: Leveraging individual Information to Generate Human INteractions (2404.09988v1)

Published 15 Apr 2024 in cs.CV

Abstract: Generating human-human motion interactions conditioned on textual descriptions is a very useful application in many areas such as robotics, gaming, animation, and the metaverse. Alongside this utility also comes a great difficulty in modeling the highly dimensional inter-personal dynamics. In addition, properly capturing the intra-personal diversity of interactions has a lot of challenges. Current methods generate interactions with limited diversity of intra-person dynamics due to the limitations of the available datasets and conditioning strategies. For this, we introduce in2IN, a novel diffusion model for human-human motion generation which is conditioned not only on the textual description of the overall interaction but also on the individual descriptions of the actions performed by each person involved in the interaction. To train this model, we use a LLM to extend the InterHuman dataset with individual descriptions. As a result, in2IN achieves state-of-the-art performance in the InterHuman dataset. Furthermore, in order to increase the intra-personal diversity on the existing interaction datasets, we propose DualMDM, a model composition technique that combines the motions generated with in2IN and the motions generated by a single-person motion prior pre-trained on HumanML3D. As a result, DualMDM generates motions with higher individual diversity and improves control over the intra-person dynamics while maintaining inter-personal coherence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV), pages 719–728. IEEE, 2019.
  2. Teach: Temporal action composition for 3d humans. In 2022 International Conference on 3D Vision (3DV), pages 414–423. IEEE, 2022.
  3. Sinc: Spatial composition of 3d human motions for simultaneous action generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9984–9995, 2023.
  4. Multidiffusion: Fusing diffusion paths for controlled image generation. In International Conference on Machine Learning, pages 1737–1752. PMLR, 2023.
  5. Didn’t see that coming: a survey on non-verbal social human behavior forecasting. In Understanding Social Behavior in Dyadic and Small Group Interactions, pages 139–178. PMLR, 2022a.
  6. Comparison of spatio-temporal models for human motion and pose forecasting in face-to-face interaction scenarios. In Understanding Social Behavior in Dyadic and Small Group Interactions, pages 107–138. PMLR, 2022b.
  7. Belfusion: Latent diffusion for behavior-driven human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2317–2327, 2023.
  8. Seamless human motion composition with blended positional encodings. arXiv preprint arXiv:2402.15509, 2024.
  9. Signature verification using a "siamese" time delay neural network. In Advances in Neural Information Processing Systems. Morgan-Kaufmann, 1993.
  10. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  11. Digital life project: Autonomous 3d characters with social intelligence. arXiv preprint arXiv:2312.04547, 2023.
  12. Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18000–18010, 2023.
  13. Dyadformer: A multi-modal transformer for long-range modeling of dyadic interactions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2177–2188, 2021.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  15. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  16. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia. ACM, 2020.
  17. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, 2022a.
  18. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, 2022b.
  19. TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts. In European Conference on Computer Vision, pages 580–597. Springer, 2022c.
  20. Momask: Generative masked modeling of 3d human motions. arXiv preprint arXiv:2312.00063, 2023.
  21. Multi-person extreme motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13053–13064, 2022d.
  22. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  23. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  24. Collaborative diffusion for multi-modal face generation and editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6080–6090, 2023.
  25. Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems, 36, 2024.
  26. FLAME: Free-form Language-based Motion Synthesis & Editing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8255–8263, 2023.
  27. Music-driven group choreography. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8673–8682, 2023.
  28. Intergen: Diffusion-based multi-human motion generation under complex interactions. arXiv preprint arXiv:2304.05684, 2023.
  29. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  30. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022.
  31. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021.
  32. Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10985–10995, 2021.
  33. TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision, pages 480–497. Springer, 2022.
  34. Mmm: Generative masked motion model. arXiv preprint arXiv:2312.03596, 2023.
  35. Learning transferable visual models. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  36. Human Motion Diffusion as a Generative Prior. arXiv preprint arXiv:2303.01418, 2023.
  37. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  38. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020.
  39. Role-Aware Interaction Generation from Textual Description. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15999–16009, 2023.
  40. Motionclip: Exposing human motion generation to clip space. In European Conference on Computer Vision, pages 358–374. Springer, 2022.
  41. Human motion diffusion model. In The Eleventh International Conference on Learning Representations, 2023.
  42. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  43. Edge: Editable dance generation from music. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448–458, 2023.
  44. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  45. Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Representations (ICLR), 2022.
  46. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
  47. Generating holistic 3d human motion from speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 469–480, 2023.
  48. Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16010–16021, 2023.
  49. T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations. arXiv preprint arXiv:2301.06052, 2023a.
  50. Remodiffuse: Retrieval-augmented motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 364–373, 2023b.
  51. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  52. Diffcollage: Parallel generation of large content with diffusion models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10188–10198. IEEE, 2023c.
  53. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  54. Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 509–519, 2023.
  55. Taming diffusion models for audio-driven co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10544–10553, 2023a.
  56. Human motion generation: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–20, 2023b.
  57. Social motion prediction with cognitive hierarchies. Advances in Neural Information Processing Systems, 36, 2024.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com