Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models (2310.02242v1)

Published 3 Oct 2023 in cs.CV and cs.GR

Abstract: This paper presents a novel approach to generating the 3D motion of a human interacting with a target object, with a focus on solving the challenge of synthesizing long-range and diverse motions, which could not be fulfilled by existing auto-regressive models or path planning-based methods. We propose a hierarchical generation framework to solve this challenge. Specifically, our framework first generates a set of milestones and then synthesizes the motion along them. Therefore, the long-range motion generation could be reduced to synthesizing several short motion sequences guided by milestones. The experiments on the NSM, COUCH, and SAMP datasets show that our approach outperforms previous methods by a large margin in both quality and diversity. The source code is available on our project page https://zju3dv.github.io/hghoi.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. A stochastic conditioning scheme for diverse human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  2. Analytic-DPM: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. In International Conference on Learning Representations, 2022.
  3. BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction. arXiv e-prints, page arXiv:2211.14304, Nov. 2022.
  4. Hp-gan: Probabilistic 3d human motion prediction via gan. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1499–149909, 2018.
  5. Learning progressive joint propagation for human motion prediction. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII, page 226–242, Berlin, Heidelberg, 2020. Springer-Verlag.
  6. Long-term human motion prediction with scene context. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I, page 387–404, Berlin, Heidelberg, 2020. Springer-Verlag.
  7. Learning to sit: Synthesizing human-chair interactions via hierarchical control. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 5887–5895, 2021.
  8. Executing your Commands via Motion Diffusion in Latent Space. arXiv e-prints, page arXiv:2212.04048, Dec. 2022.
  9. Simon Clavet. Motion matching and the road to next-gen animation. In GDC, 2016.
  10. Learning dynamic relationships for 3d human motion prediction. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6518–6526, 2020.
  11. MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis. arXiv e-prints, page arXiv:2212.04495, Dec. 2022.
  12. Diffusion models beat gans on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8780–8794. Curran Associates, Inc., 2021.
  13. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883, June 2021.
  14. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
  15. Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1396–1406, 2021.
  16. Generative adversarial networks. Commun. ACM, 63(11):139–144, oct 2020.
  17. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, June 2022.
  18. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968.
  19. Stochastic scene-aware motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11374–11384, October 2021.
  20. Populating 3d scenes by learning human-scene interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14708–14718, June 2021.
  21. Synthesizing physical character-scene interactions. In ACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH ’23, New York, NY, USA, 2023. Association for Computing Machinery.
  22. Nemf: Neural motion fields for kinematic animation. In NeurIPS, 2022.
  23. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020.
  24. Classifier-Free Diffusion Guidance. arXiv e-prints, page arXiv:2207.12598, July 2022.
  25. Video Diffusion Models. arXiv e-prints, page arXiv:2204.03458, Apr. 2022.
  26. Phase-functioned neural networks for character control. ACM Trans. Graph., 36(4), jul 2017.
  27. Diffusion-based Generation, Optimization, and Planning in 3D Scenes. arXiv e-prints, page arXiv:2301.06015, Jan. 2023.
  28. Structural-rnn: Deep learning on spatio-temporal graphs. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5308–5317, 2016.
  29. Motion puzzle: Arbitrary motion style transfer by body part. ACM Trans. Graph., 41(3), jun 2022.
  30. Convolutional autoencoders for human motion infilling. In 2020 International Conference on 3D Vision (3DV), pages 918–927, 2020.
  31. FLAME: Free-form Language-based Motion Synthesis & Editing. arXiv e-prints, page arXiv:2209.00349, Sept. 2022.
  32. Adam: A Method for Stochastic Optimization. arXiv e-prints, page arXiv:1412.6980, Dec. 2014.
  33. Auto-Encoding Variational Bayes. arXiv e-prints, page arXiv:1312.6114, Dec. 2013.
  34. Bihmp-gan: Bidirectional 3d human motion prediction gan. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press, 2019.
  35. Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  36. Character controllers using motion vaes. ACM Trans. Graph., 39(4), aug 2020.
  37. Posegpt: Quantization-based 3d human motion generation and forecasting. In European Conference on Computer Vision (ECCV), 2022.
  38. Pretrained Diffusion Models for Unified Human Motion Synthesis. arXiv e-prints, page arXiv:2212.02837, Dec. 2022.
  39. Multi-level motion attention for human motion prediction. International Journal of Computer Vision, 129(9):2513–2535, 2021.
  40. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  41. Autosdf: Shape priors for 3d completion, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 306–315, June 2022.
  42. Geostatistical motion interpolation. ACM Trans. Graph., 24(3):1062–1070, jul 2005.
  43. Improved denoising diffusion probabilistic models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8162–8171. PMLR, 18–24 Jul 2021.
  44. Quaternet: A quaternion-based recurrent model for human motion. In British Machine Vision Conference (BMVC), 2018.
  45. Action-conditioned 3D human motion synthesis with transformer VAE. In International Conference on Computer Vision (ICCV), 2021.
  46. TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision (ECCV), 2022.
  47. DreamFusion: Text-to-3D using 2D Diffusion. arXiv e-prints, page arXiv:2209.14988, Sept. 2022.
  48. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv e-prints, page arXiv:2204.06125, Apr. 2022.
  49. Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
  50. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
  51. Verbs and adverbs: multidimensional motion interpolation. IEEE Computer Graphics and Applications, 18(5):32–40, 1998.
  52. Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv e-prints, page arXiv:2209.14792, Sept. 2022.
  53. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2256–2265, Lille, France, 07–09 Jul 2015. PMLR.
  54. Improved techniques for training score-based generative models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
  55. Deepphase: Periodic autoencoders for learning motion phase manifolds. ACM Trans. Graph., 41(4), jul 2022.
  56. Neural state machine for character-scene interactions. ACM Trans. Graph., 38(6), nov 2019.
  57. Local motion phases for learning multi-contact character movements. ACM Trans. Graph., 39(4), aug 2020.
  58. Goal: Generating 4d whole-body motion for hand-object grasping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13263–13273, June 2022.
  59. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
  60. EDGE: Editable Dance Generation From Music. arXiv e-prints, page arXiv:2211.10658, Nov. 2022.
  61. Neural discrete representation learning. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  62. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  63. Towards diverse and natural scene-aware 3d human motion synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20460–20469, June 2022.
  64. Synthesizing long-term 3d human motion and interaction in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9401–9411, June 2021.
  65. Scene-aware generative network for human motion synthesis. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12201–12210, Los Alamitos, CA, USA, jun 2021. IEEE Computer Society.
  66. History repeats itself: Human motion prediction via motion attention. In ECCV, 2020.
  67. Learning trajectory dependencies for human motion prediction. In ICCV, 2019.
  68. Interpolation synthesis of articulated figure motion. IEEE Computer Graphics and Applications, 17(6):39–45, 1997.
  69. Saga: Stochastic whole-body grasping with contact. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  70. Mt-vae: Learning motion transformations to generate multimodal human dynamics. In European Conference on Computer Vision, pages 276–293. Springer, 2018.
  71. Dlow: Diversifying latent flows for diverse human motion prediction. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  72. Mode-adaptive neural networks for quadruped motion control. ACM Trans. Graph., 37(4), jul 2018.
  73. Manipnet: Neural manipulation synthesis with a hand-object spatial representation. ACM Trans. Graph., 40(4), jul 2021.
  74. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
  75. Place: Proximity learning of articulation and contact in 3d environments. In 2020 International Conference on 3D Vision (3DV), pages 642–651, Los Alamitos, CA, USA, nov 2020. IEEE Computer Society.
  76. Couch: Towards controllable human-chair interactions. In European Conference on Computer Vision (ECCV). Springer, October 2022.
  77. We are more than our joints: Predicting how 3d bodies move. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3372–3382, 2021.
  78. Generating 3d people in scenes without people. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  79. The wanderings of odysseus in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20481–20491, 2022.
  80. Gimo: Gaze-informed human motion prediction in context. arXiv preprint arXiv:2204.09443, 2022.
  81. Auto-conditioned recurrent networks for extended complex human motion synthesis. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
  82. UDE: A Unified Driving Engine for Human Motion Generation. arXiv e-prints, page arXiv:2211.16016, Nov. 2022.
Citations (23)

Summary

We haven't generated a summary for this paper yet.