Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction (2404.00562v2)

Published 31 Mar 2024 in cs.CV

Abstract: This paper introduces the first text-guided work for generating the sequence of hand-object interaction in 3D. The main challenge arises from the lack of labeled data where existing ground-truth datasets are nowhere near generalizable in interaction type and object category, which inhibits the modeling of diverse 3D hand-object interaction with the correct physical implication (e.g., contacts and semantics) from text prompts. To address this challenge, we propose to decompose the interaction generation task into two subtasks: hand-object contact generation; and hand-object motion generation. For contact generation, a VAE-based network takes as input a text and an object mesh, and generates the probability of contacts between the surfaces of hands and the object during the interaction. The network learns a variety of local geometry structure of diverse objects that is independent of the objects' category, and thus, it is applicable to general objects. For motion generation, a Transformer-based diffusion model utilizes this 3D contact map as a strong prior for generating physically plausible hand-object motion as a function of text prompts by learning from the augmented labeled dataset; where we annotate text labels from many existing 3D hand and object motion data. Finally, we further introduce a hand refiner module that minimizes the distance between the object surface and hand joints to improve the temporal stability of the object-hand contacts and to suppress the penetration artifacts. In the experiments, we demonstrate that our method can generate more realistic and diverse interactions compared to other baseline methods. We also show that our method is applicable to unseen objects. We will release our model and newly labeled data as a strong foundation for future research. Codes and data are available in: https://github.com/JunukCha/Text2HOI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Contactdb: Analyzing and predicting grasp contact via thermal imaging. In CVPR, 2019a.
  2. Contactgrasp: Functional multi-finger grasp synthesis from contact. In IROS, 2019b.
  3. Executing your commands via motion diffusion in latent space. In CVPR, 2023.
  4. Ganhand: Predicting human grasp affordances in multi-object scenes. In CVPR, 2020.
  5. Arctic: A dataset for dexterous bimanual hand-object manipulation. In CVPR, 2023.
  6. Imos: Intent-driven full-body motion synthesis for human-object interactions. In Computer Graphics Forum, 2023.
  7. Contactopt: Optimizing contact to improve grasps. In CVPR, 2021.
  8. Generating diverse and natural 3d human motions from text. In CVPR, 2022a.
  9. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In ECCV, 2022b.
  10. Honnotate: A method for 3d annotation of hand and object poses. In CVPR, 2020.
  11. Denoising diffusion probabilistic models. NIPS, 2020.
  12. Motiongpt: Human motion as a foreign language. arXiv preprint arXiv:2306.14795, 2023.
  13. Hand-object contact consistency reasoning for human grasps generation. In ICCV, 2021.
  14. Grasping field: Learning implicit representations for human grasps. In 2020 International Conference on 3D Vision (3DV), 2020.
  15. Flame: Free-form language-based motion synthesis & editing. In AAAI, 2023.
  16. H2o: Two hands manipulating objects for first person interaction recognition. In ICCV, 2021.
  17. Multiact: Long-term 3d human motion generation from multiple action labels. In AAAI, 2023.
  18. Contact2grasp: 3d grasp synthesis via hand-object contact constraint. arXiv preprint arXiv:2210.09245, 2022.
  19. Intergen: Diffusion-based multi-human motion generation under complex interactions. arXiv preprint arXiv:2304.05684, 2023.
  20. 6-dof graspnet: Variational grasp generation for object manipulation. In ICCV, 2019.
  21. Temos: Generating diverse human motions from textual descriptions. In ECCV, 2022.
  22. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
  23. Learning transferable visual models from natural language supervision. In ICML, 2021.
  24. Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics (TOG), 2017.
  25. Grab: A dataset of whole-body human grasping of objects. In ECCV, 2020.
  26. Grip: Generating interaction poses using latent consistency and spatial cues. arXiv preprint arXiv:2308.11617, 2023.
  27. Human motion diffusion model. In ICLR, 2023.
  28. Attention is all you need. NIPS, 2017.
  29. Manipnet: neural manipulation synthesis with a hand-object spatial representation. ACM Transactions on Graphics (ToG), 2021.
  30. Generating human motion from textual descriptions with discrete representations. In CVPR, 2023.
  31. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
  32. Cams: Canonicalized manipulation spaces for category-level functional hand-object manipulation synthesis. In CVPR, 2023.
  33. Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. In ICCV, 2023.
  34. Toch: Spatio-temporal object-to-hand correspondence for motion refinement. In ECCV, 2022.
  35. On the continuity of rotation representations in neural networks. In CVPR, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Junuk Cha (10 papers)
  2. Jihyeon Kim (7 papers)
  3. Jae Shin Yoon (22 papers)
  4. Seungryul Baek (32 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com