Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation (2312.07063v3)

Published 12 Dec 2023 in cs.CV

Abstract: Reconstructing human-object interaction in 3D from a single RGB image is a challenging task and existing data driven methods do not generalize beyond the objects present in the carefully curated 3D interaction datasets. Capturing large-scale real data to learn strong interaction and 3D shape priors is very expensive due to the combinatorial nature of human-object interactions. In this paper, we propose ProciGen (Procedural interaction Generation), a method to procedurally generate datasets with both, plausible interaction and diverse object variation. We generate 1M+ human-object interaction pairs in 3D and leverage this large-scale data to train our HDM (Hierarchical Diffusion Model), a novel method to reconstruct interacting human and unseen objects, without any templates. Our HDM is an image-conditioned diffusion model that learns both realistic interaction and highly accurate human and object shapes. Experiments show that our HDM trained with ProciGen significantly outperforms prior methods that requires template meshes and that our dataset allows training methods with strong generalization ability to unseen object instances. Our code and data are released.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (104)
  1. https://renderpeople.com/.
  2. http://virtualhumans.mpi-inf.mpg.de/people.html.
  3. Learning representations and generative models for 3d point clouds, 2018.
  4. Multi-garment net: Learning to dress 3d people from images. In IEEE International Conference on Computer Vision (ICCV). IEEE, 2019.
  5. Combining implicit function learning and parametric models for 3d human reconstruction. In European Conference on Computer Vision (ECCV). Springer, 2020a.
  6. Loopreg: Self-supervised learning of implicit surface correspondences, pose and shape for 3d human mesh registration. In Advances in Neural Information Processing Systems (NeurIPS), 2020b.
  7. Behave: Dataset and method for tracking human object interactions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  8. Physically plausible 3d human-scene reconstruction from monocular rgb image using an adversarial learning approach. IEEE Robotics and Automation Letters, 8(10):6227–6234, 2023.
  9. BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 8726–8737, 2023.
  10. ContactDB: Analyzing and predicting grasp contact via thermal imaging. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  11. ContactPose: A dataset of grasps with object contact and hand pose. In The European Conference on Computer Vision (ECCV), 2020.
  12. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
  13. Text2tex: Text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396, 2023a.
  14. Detecting human-object contact in images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  15. Abo: Dataset and benchmarks for real-world 3d object understanding. CVPR, 2022.
  16. Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
  17. Ganhand: Predicting human grasp affordances in multi-object scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  18. Learned vertex descent: A new direction for 3d human model fitting. In European Conference on Computer Vision (ECCV). Springer, 2022.
  19. Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022a.
  20. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. In NeurIPS, 2022b. Outstanding Paper Award.
  21. Deformed implicit field: Modeling 3d shapes with learned dense correspondence. In IEEE Computer Vision and Pattern Recognition, 2021.
  22. Use the force, luke! learning to predict physical forces by simulating effects. In CVPR, 2020.
  23. Kubric: a scalable dataset generator. 2022.
  24. Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021.
  25. Interaction replica: Tracking human–object interaction and scene changes from human motion. In International Conference on 3D Vision (3DV), 2024.
  26. Honnotate: A method for 3d annotation of hand and object poses. In CVPR, 2020.
  27. Resolving 3d human pose ambiguities with 3d scene constraints. In International Conference on Computer Vision, 2019.
  28. Learning joint reconstruction of hands and manipulated objects. In CVPR, 2019.
  29. Masked Autoencoders Are Scalable Vision Learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, New Orleans, LA, USA, 2022. IEEE.
  30. Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239, 2020.
  31. Jointly learning heterogeneous features for rgb-d activity recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11):2186–2200, 2017.
  32. Capturing and inferring dense full-body human-scene contact. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 13274–13285, 2022a.
  33. InterCap: Joint markerless 3D tracking of humans and objects in interaction. In German Conference on Pattern Recognition (GCPR), pages 281–299. Springer, 2022b.
  34. TeCH: Text-guided Reconstruction of Lifelike Clothed Humans. In International Conference on 3D Vision (3DV), 2024.
  35. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2014.
  36. Self-supervised pre-training with masked shape prediction for 3d scene understanding. In CVPR, 2023a.
  37. Instant-nvr: Instant neural volumetric rendering for human-object interactions from monocular rgbd stream, 2023b.
  38. End-to-end recovery of human shape and pose. In Computer Vision and Pattern Recognition (CVPR), 2018.
  39. Grasping field: Learning implicit representations for human grasps. In 8th International Conference on 3D Vision, pages 333–344. IEEE, 2020.
  40. Segment anything in high quality. arXiv:2306.01567, 2023.
  41. Segment anything. arXiv:2304.02643, 2023.
  42. Cliff: Carrying location information in full frames into human pose and shape estimation. In ECCV, 2022a.
  43. Mocapdeform: Monocular 3d human motion capture in deformable scenes. In International Conference on 3D Vision (3DV), 2022b.
  44. TADA! Text to Animatable Digital Avatars. In International Conference on 3D Vision (3DV), 2024.
  45. Microsoft coco: Common objects in context. In Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.
  46. Learning implicit functions for topology-varying dense 3d shape correspondence, 2020.
  47. Masked discrimination for self-supervised learning on point clouds. Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  48. Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019a.
  49. Hosnerf: Dynamic human-object-scene neural radiance fields from a single video. arXiv preprint arXiv:2304.12281, 2023a.
  50. One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization. In Annual Conference on Neural Information Processing Systems (NeurIPS), 2023b.
  51. Zero-1-to-3: Zero-shot one image to 3d object, 2023c.
  52. Point-voxel cnn for efficient 3d deep learning. In Conference on Neural Information Processing Systems (NeurIPS), 2019b.
  53. SMPL: A skinned multi-person linear model. In ACM Transactions on Graphics. ACM, 2015.
  54. AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision, pages 5442–5451, 2019.
  55. Realfusion: 360 reconstruction of any object from a single image. In CVPR, 2023a.
  56. Pc2: Projection-conditioned point cloud diffusion for single-image 3d reconstruction. In CVPR, 2023b.
  57. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  58. Masked autoencoders for point cloud self-supervised learning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part II, pages 604–621. Springer, 2022.
  59. AGORA: Avatars in geography optimized for regression analysis. In Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.
  60. Object pop-up: Can we infer 3d objects and their poses from human interactions alone? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  61. Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022.
  62. Pointnet: Deep learning on point sets for 3d classification and segmentation. CVPR, abs/1612.00593, 2017.
  63. Infinite photorealistic worlds using procedural generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12630–12641, 2023.
  64. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), 2017.
  65. Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration. In IEEE International Conference on Computer Vision Workshops, 2021.
  66. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In IEEE International Conference on Computer Vision (ICCV). IEEE, 2019.
  67. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  68. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
  69. Zero123++: a single image to consistent multi-view diffusion base model, 2023.
  70. Hulc: 3d human motion capture with pose manifold sampling and dense contact guidance. In European Conference on Computer Vision (ECCV), pages 516–533, 2022.
  71. f-brs: Rethinking backpropagating refinement for interactive segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8623–8632, 2020.
  72. Generative Modeling by Estimating Gradients of the Data Distribution. Curran Associates Inc., Red Hook, NY, USA, 2019.
  73. Learning 3d shape completion under weak supervision. CoRR, abs/1805.07290, 2018.
  74. Reduced representation of deformation fields for effective non-rigid shape matching. Advances in Neural Information Processing Systems, 35, 2022.
  75. Doublefusion: Real-time capture of human performance with inner body shape from a depth sensor. In IEEE Conf. on Computer Vision and Pattern Recognition, 2018.
  76. What do single-view 3d reconstruction networks learn? 2019.
  77. Sizer: A dataset and model for parsing 3d clothing and learning size sensitive 3d clothing. In European Conference on Computer Vision (ECCV). Springer, 2020.
  78. DECO: Dense estimation of 3D human-scene contact in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8001–8013, 2023.
  79. Gecco: Geometrically-conditioned point diffusion models. ICCV, abs/2303.05916, 2023.
  80. Capturing hands in action using discriminative salient points and physics simulation. International Journal of Computer Vision (IJCV), 2016.
  81. Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  82. Reconstructing action-conditioned human-object interactions using commonsense knowledge priors. In International Conference on 3D Vision (3DV), 2022.
  83. BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects. In CVPR, 2023.
  84. Simnp: Learning self-similarity priors between neural points, 2023.
  85. Synscapes: A photorealistic synthetic dataset for street scene parsing, 2018.
  86. 3d shapenets: A deep representation for volumetric shapes. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1912–1920, 2015.
  87. Chore: Contact, human and object reconstruction from a single rgb image. In European Conference on Computer Vision (ECCV). Springer, 2022.
  88. Visibility aware human-object interaction tracking from single rgb camera. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  89. ICON: Implicit Clothed humans Obtained from Normals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13296–13306, 2022.
  90. ECON: Explicit Clothed humans Optimized via Normal integration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  91. D3d-hoi: Dynamic 3d human-object interactions from videos. arXiv preprint arXiv:2108.08420, 2021.
  92. CPF: Learning a contact potential field to model the hand-object interaction. In ICCV, 2021.
  93. Featurenerf: Learning generalizable nerfs by distilling pre-trained vision foundation models. arXiv preprint arXiv:2303.12786, 2023.
  94. Human-aware object placement for visual environment reconstruction. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 3959–3970, 2022.
  95. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  96. Detailed, accurate, human shape estimation from clothed 3d scan sequences. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  97. Neuraldome: A neural modeling pipeline on multi-view human-object interactions. In CVPR, 2023.
  98. Perceiving 3d human-object spatial arrangements from a single image in the wild. In European Conference on Computer Vision (ECCV), 2020.
  99. RelPose: Predicting probabilistic relative rotation for single objects in the wild. In European Conference on Computer Vision, 2022a.
  100. Learning to Reconstruct Shapes From Unseen Classes. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  101. Couch: Towards controllable human-chair interactions. In European Conference on Computer Vision (ECCV). Springer, 2022b.
  102. Adjoint rigid transform network: Task-conditioned alignment of 3d shapes. In 2022 International Conference on 3D Vision (3DV). IEEE, 2022.
  103. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5826–5835, 2021.
  104. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In CVPR, 2023.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com