Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions (2403.17827v2)

Published 26 Mar 2024 in cs.CV, cs.AI, cs.GR, and cs.LG

Abstract: Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible and semantically meaningful. Furthermore, generalization to unseen objects is hindered by the limited scale of available hand-object interaction datasets. In this paper, we propose a novel method, dubbed DiffH2O, which can synthesize realistic, one or two-handed object interactions from provided text prompts and geometry of the object. The method introduces three techniques that enable effective learning from limited data. First, we decompose the task into a grasping stage and an text-based manipulation stage and use separate diffusion models for each. In the grasping stage, the model only generates hand motions, whereas in the manipulation phase both hand and object poses are synthesized. Second, we propose a compact representation that tightly couples hand and object poses and helps in generating realistic hand-object interactions. Third, we propose two different guidance schemes to allow more control of the generated motions: grasp guidance and detailed textual guidance. Grasp guidance takes a single target grasping pose and guides the diffusion model to reach this grasp at the end of the grasping stage, which provides control over the grasping pose. Given a grasping motion from this stage, multiple different actions can be prompted in the manipulation phase. For the textual guidance, we contribute comprehensive text descriptions to the GRAB dataset and show that they enable our method to have more fine-grained control over hand-object interactions. Our quantitative and qualitative evaluation demonstrates that the proposed method outperforms baseline methods and leads to natural hand-object motions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Anonymous. 2023. GeneOH Diffusion: Towards Generalizable Hand-Object Interaction Denoising via Denoising Diffusion. In Submitted to The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=FvK2noilxT under review.
  2. Teach: Temporal action composition for 3d humans. In International Conference on 3D Vision (3DV). IEEE, 414–423.
  3. Contactdb: Analyzing and predicting grasp contact via thermal imaging. In Computer Vision and Pattern Recognition (CVPR). 8709–8719.
  4. ContactPose: A dataset of grasps with object contact and hand pose. In European Conference on Computer Vision (ECCV). Springer, 361–378.
  5. Physically Plausible Full-Body Hand-Object Interaction Synthesis. In International Conference on 3D Vision (3DV).
  6. DexYCB: A Benchmark for Capturing Hand Grasping of Objects. In Computer Vision and Pattern Recognition (CVPR).
  7. Synthesizing Dexterous Nonprehensile Pregrasp for Ungraspable Objects. In ACM SIGGRAPH 2023 Conference Proceedings (Los Angeles, CA, USA) (SIGGRAPH ’23). Association for Computing Machinery, New York, NY, USA, Article 10, 10 pages. https://doi.org/10.1145/3588432.3591528
  8. Executing your Commands via Motion Diffusion in Latent Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18000–18010.
  9. Min Jin Chong and David Forsyth. 2020. Effectively unbiased fid and inception score and where to find them. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6070–6079.
  10. D-Grasp: Physically Plausible Dynamic Grasp Synthesis for Hand-Object Interactions. In Computer Vision and Pattern Recognition (CVPR).
  11. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition (CVPR). Ieee, 248–255.
  12. Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34 (2021), 8780–8794.
  13. Christian Diller and Angela Dai. 2023. CG-HOI: Contact-Guided 3D Human-Object Interaction Generation. arXiv preprint arXiv:2311.16097 (2023).
  14. ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation. In Computer Vision and Pattern Recognition (CVPR).
  15. Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin (1971).
  16. Physics-based dexterous manipulations with estimated hand poses and residual reinforcement learning. In International Conference on Robotics and Automation (ICRA). IEEE, 9561–9568.
  17. IMoS: Intent-Driven Full-Body Motion Synthesis for Human-Object Interactions. In Eurographics.
  18. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia. 2021–2029.
  19. HOnnotate: A method for 3D Annotation of Hand and Object Poses. In Computer Vision and Pattern Recognition (CVPR).
  20. Synthesizing Physical Character-scene Interactions. In International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
  21. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
  22. Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications: (2022).
  23. Video diffusion models. arXiv:2204.03458 (2022).
  24. Diffusion-based Generation, Optimization, and Planning in 3D Scenes. arXiv:2301.06015 [cs.CV]
  25. Rethinking FID: Towards a Better Evaluation Metric for Image Generation. arXiv:2401.09603 [cs.CV]
  26. Guided Motion Diffusion for Controllable Human Motion Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2151–2162.
  27. aitviewer. https://doi.org/10.5281/zenodo.10013305
  28. DiffWave: A Versatile Diffusion Model for Audio Synthesis. In International Conference on Learning Representations. https://openreview.net/forum?id=a-xFK8Ymz5J
  29. H2O: Two Hands Manipulating Objects for First Person Interaction Recognition. In International Conference on Computer Vision (ICCV). 10138–10148.
  30. Jiye Lee and Hanbyul Joo. 2023. Locomotion-Action-Manipulation: Synthesizing Human-Scene Interactions in Complex 3D Environments. arXiv:2301.02667 [cs.CV]
  31. Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection? An Investigation and the HOI-Synth Domain Adaptation Benchmark. arXiv preprint arXiv:2312.02672 (2023).
  32. Controllable Human-Object Interaction Synthesis. arXiv preprint arXiv:2312.03913 (2023).
  33. Task-oriented human-object interactions generation with implicit neural representations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3035–3044.
  34. HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction. In Computer Vision and Pattern Recognition (CVPR). 21013–21022.
  35. Dynamics-regulated kinematic policy for egocentric pose estimation. Advances in Neural Information Processing Systems 34 (2021), 25019–25032.
  36. AMASS: Archive of Motion Capture as Surface Shapes. In International Conference on Computer Vision. 5442–5451.
  37. Priyanka Mandikal and Kristen Grauman. 2021. Learning Dexterous Grasping with Object-Centric Visual Affordances. In International Conference on Robotics and Automation (ICRA).
  38. Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning. PMLR, 8162–8171.
  39. Atsuhiro Noguchi and Tatsuya Harada. 2019. Image generation from small datasets via batch statistics adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2750–2758.
  40. Reconstructing Hands in 3D with Transformers. In arxiv.
  41. HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models. arXiv preprint arXiv:2312.06553 (2023).
  42. Efficient learning on point clouds with basis point sets. In Computer Vision and Pattern Recognition (CVPR). 4332–4341.
  43. DexMV: Imitation Learning for Dexterous Manipulation from Human Videos. arXiv preprint arXiv:2108.05877 (2021).
  44. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  45. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. In Robotics: Science and Systems (RSS).
  46. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  47. Embodied Hands: Modeling and Capturing Hands and Bodies Together. Transactions on Graphics (TOG) 36, 6 (Nov. 2017).
  48. Learning High-DOF Reaching-and-Grasping via Dynamic Representation of Gripper-Object Interaction. Transactions on Graphics (TOG) 41, 4 (2022), 97:1–97:14.
  49. Macs: Mass conditioned 3d hand and object motion synthesis. arXiv preprint arXiv:2312.14929 (2023).
  50. In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval. International Conference on Computer Vision (ICCV) (2023).
  51. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. PMLR, 2256–2265.
  52. GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping. In Computer Vision and Pattern Recognition (CVPR). https://goal.is.tue.mpg.de
  53. GRAB: A Dataset of Whole-Body Human Grasping of Objects. In European Conference on Computer Vision (ECCV). https://grab.is.tue.mpg.de
  54. FLEX: Full-Body Grasping Without Full-Body Grasps. In Computer Vision and Pattern Recognition (CVPR).
  55. Human Motion Diffusion Model. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=SJ1kSyO2jwu
  56. Scene-aware generative network for human motion synthesis. In Computer Vision and Pattern Recognition (CVPR). 12206–12215.
  57. SAGA: Stochastic Whole-Body Grasping with Contact. In Proceedings of the European Conference on Computer Vision (ECCV).
  58. InterDiff: Generating 3D Human-Object Interactions with Physics-Informed Diffusion. In International Conference on Computer Vision (ICCV).
  59. Diffusion models: A comprehensive survey of methods and applications. Comput. Surveys 56, 4 (2023), 1–39.
  60. Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips. In International Conference on Computer Vision (ICCV).
  61. Affordance Diffusion: Synthesizing Hand-Object Interactions. In Computer Vision and Pattern Recognition (CVPR).
  62. Yuting Ye and C Karen Liu. 2012. Synthesis of detailed hand manipulations using contact sampling. Transactions on Graphics (TOG) 31, 4 (2012), 1–10.
  63. PhysDiff: Physics-Guided Human Motion Diffusion Model. In International Conference on Computer Vision (ICCV).
  64. ArtiGrasp: Physically Plausible Synthesis of Bi-Manual Dexterous Grasping and Articulation. In International Conference on 3D Vision (3DV).
  65. Manipnet: neural manipulation synthesis with a hand-object spatial representation. Transactions on Graphics (TOG) 40, 4 (2021), 1–14.
  66. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
  67. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. arXiv preprint arXiv:2208.15001 (2022).
  68. COUCH: Towards Controllable Human-Chair Interactions. (October 2022).
  69. CAMS: CAnonicalized Manipulation Spaces for Category-Level Functional Hand-Object Manipulation Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 585–594.
  70. Toch: Spatio-temporal object-to-hand correspondence for motion refinement. In European Conference on Computer Vision. Springer, 1–19.
  71. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5745–5753.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Sammy Christen (21 papers)
  2. Shreyas Hampali (13 papers)
  3. Fadime Sener (21 papers)
  4. Edoardo Remelli (14 papers)
  5. Tomas Hodan (22 papers)
  6. Eric Sauser (3 papers)
  7. Shugao Ma (19 papers)
  8. Bugra Tekin (22 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets