Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DECO: Dense Estimation of 3D Human-Scene Contact In The Wild (2309.15273v1)

Published 26 Sep 2023 in cs.CV

Abstract: Understanding how humans use physical contact to interact with the world is key to enabling human-centric artificial intelligence. While inferring 3D contact is crucial for modeling realistic and physically-plausible human-object interactions, existing methods either focus on 2D, consider body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. In contrast, we focus on inferring dense, 3D contact between the full body surface and objects in arbitrary images. To achieve this, we first collect DAMON, a new dataset containing dense vertex-level contact annotations paired with RGB images containing complex human-object and human-scene contact. Second, we train DECO, a novel 3D contact detector that uses both body-part-driven and scene-context-driven attention to estimate vertex-level contact on the SMPL body. DECO builds on the insight that human observers recognize contact by reasoning about the contacting body parts, their proximity to scene objects, and the surrounding scene context. We perform extensive evaluations of our detector on DAMON as well as on the RICH and BEHAVE datasets. We significantly outperform existing SOTA methods across all benchmarks. We also show qualitatively that DECO generalizes well to diverse and challenging real-world human interactions in natural images. The code, data, and models are available at https://deco.is.tue.mpg.de.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. PoseTrack: A benchmark for human pose estimation and tracking. In Computer Vision and Pattern Recognition (CVPR), pages 5167–5176, 2018.
  2. Layer normalization. CoRR, abs/1607.06450, 2016.
  3. BEHAVE: Dataset and method for tracking human object interactions. In Computer Vision and Pattern Recognition (CVPR), pages 15935–15946, 2022.
  4. ContactPose: A dataset of grasps with object contact and hand pose. In European Conference on Computer Vision (ECCV), volume 12358, pages 361–378. Springer, 2020.
  5. Long-term human motion prediction with scene context. In European Conference on Computer Vision (ECCV), volume 12346, pages 387–404, 2020.
  6. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(1):172–186, 2021.
  7. Reconstructing hand-object interactions in the wild. International Conference on Computer Vision (ICCV), pages 12397–12406, 2021.
  8. Detecting human-object contact in images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  9. Masked-attention mask transformer for universal image segmentation. In Computer Vision and Pattern Recognition (CVPR), pages 1290–1299, 2022.
  10. Accurate 3D body shape regression using metric and semantic attributes. In Computer Vision and Pattern Recognition (CVPR), pages 2718–2728, 2022.
  11. Bodies at Rest: 3D human pose and shape estimation from a pressure image using synthetic data. In Computer Vision and Pattern Recognition (CVPR), pages 6214–6223, 2020.
  12. ImageNet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.
  13. Collaborative regression of expressive bodies using moderation. In International Conference on 3D Vision (3DV), 2021.
  14. Three-dimensional reconstruction of human interactions. In Computer Vision and Pattern Recognition (CVPR), pages 7212–7221, 2020.
  15. Learning complex 3D human self-contact. In AAAI Conference on Artificial Intelligence, 2021.
  16. REMIPS: Physically consistent 3D reconstruction of multiple interacting people under weak supervision. In Conference on Neural Information Processing Systems (NeurIPS), volume 34, pages 19385–19397. Curran Associates, Inc., 2021.
  17. Learning dynamics from kinematics: Estimating 2D foot pressure maps from video frames. arXiv:1811.12607, 2018.
  18. Differentiable dynamics for articulated 3D human motion reconstruction. In Computer Vision and Pattern Recognition (CVPR), pages 13180–13190, 2022.
  19. Graphonomy: Universal human parsing via graph transfer learning. In Computer Vision and Pattern Recognition (CVPR), pages 7450–7459, 2019.
  20. PressureVision: Estimating hand pressure from a single RGB image. In European Conference on Computer Vision (ECCV), 2022.
  21. Contactopt: Optimizing contact to improve grasps. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1471–1481, 2021.
  22. Visual semantic role labeling. arXiv:1505.04474, 2015.
  23. Human POSEitioning System (HPS): 3D human pose estimation and self-localization in large scenes from body-mounted sensors. In Computer Vision and Pattern Recognition (CVPR), pages 4318–4329, 2021.
  24. Kilem L Gwet. Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC, 2014.
  25. Resolving 3D human pose ambiguities with 3D scene constraints. In International Conference on Computer Vision (ICCV), pages 2282–2292, 2019.
  26. Populating 3D scenes by learning human-scene interaction. In Computer Vision and Pattern Recognition (CVPR), pages 14708–14718, 2021.
  27. Learning joint reconstruction of hands and manipulated objects. In Computer Vision and Pattern Recognition (CVPR), pages 11807–11816, 2019.
  28. Mask R-CNN. In International Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
  29. Capturing and inferring dense full-body human-scene contact. In Computer Vision and Pattern Recognition (CVPR), 2022.
  30. InterCap: Joint markerless 3D tracking of humans and objects in interaction. In German Conference on Pattern Recognition (GCPR), volume 13485, pages 281–299, 2022.
  31. Knowing when to put your foot down. In Proceedings of the 2006 Symposium on Interactive 3D Graphics and Games, page 49–53, 2006.
  32. End-to-end recovery of human shape and pose. In Computer Vision and Pattern Recognition (CVPR), pages 7122–7131, 2018.
  33. Occluded human mesh recovery. In Computer Vision and Pattern Recognition (CVPR), pages 1705–1715, 2022.
  34. HOTR: End-to-end human-object interaction detection with transformers. In Computer Vision and Pattern Recognition (CVPR), 2021.
  35. PARE: Part attention regressor for 3D human body estimation. In International Conference on Computer Vision (ICCV), pages 11127–11137, 2021.
  36. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In International Conference on Computer Vision (ICCV), pages 2252–2261, 2019.
  37. Physically grounded spatio-temporal object affordances. In European Conference on Computer Vision (ECCV), pages 831–847. Springer, 2014.
  38. Pastanet: Toward human activity knowledge engine. In Computer Vision and Pattern Recognition (CVPR), pages 382–391, 2020.
  39. CLIFF: Carrying location information in full frames into human pose and shape estimation. In European Conference on Computer Vision (ECCV), volume 13665, pages 590–606, 2022.
  40. End-to-end human pose and mesh reconstruction with transformers. In Computer Vision and Pattern Recognition (CVPR), pages 1954–1963, 2021.
  41. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), volume 8693, pages 740–755, 2014.
  42. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
  43. Swin transformer: Hierarchical vision transformer using shifted windows. In International Conference on Computer Vision (ICCV), pages 10012–10022, 2021.
  44. SMPL: A skinned multi-person linear model. Transactions on Graphics (TOG), 34(6):248:1–248:16, 2015.
  45. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Conference on Neural Information Processing Systems (NeurIPS), pages 13–23, 2019.
  46. AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision (ICCV), pages 5441–5450, 2019.
  47. VNect: Real-time 3D human pose estimation with a single RGB camera. Transactions on Graphics (TOG), 36(4):44:1–44:14, 2017.
  48. On self-contact and human pose. In Computer Vision and Pattern Recognition (CVPR), pages 9990–9999, 2021.
  49. Detecting hands and recognizing physical contact in the wild. Conference on Neural Information Processing Systems (NeurIPS), 33:7841–7851, 2020.
  50. Learning human-object interactions by graph parsing neural networks. In European Conference on Computer Vision (ECCV), pages 401–417, 2018.
  51. Accelerating 3D deep learning with PyTorch3D. CoRR, abs/2007.08501, 2020.
  52. HuMoR: 3D human motion model for robust pose estimation. In International Conference on Computer Vision (ICCV), pages 11488–11499, 2021.
  53. Contact and human dynamics from monocular video. In European Conference on Computer Vision (ECCV), pages 71–87. Springer, 2020.
  54. Embodied hands: Modeling and capturing hands and bodies together. Transactions on Graphics (TOG), 36(6):245:1–245:17, 2017.
  55. A multi-scale CNN for affordance segmentation in RGB images. In European Conference on Computer Vision (ECCV), pages 186–201, 2016.
  56. BITE: Beyond priors for improved three-D dog pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8867–8876, June 2023.
  57. From image to stability: learning dynamics from human pose. In European Conference on Computer Vision (ECCV), pages 536–554, 2020.
  58. Understanding human hands in contact at internet scale. In Computer Vision and Pattern Recognition (CVPR), pages 9869–9878, 2020.
  59. Motionet: 3D human motion reconstruction from monocular video with skeleton consistency. Transactions on Graphics (TOG), 40(1):1–15, 2020.
  60. HULC: 3D human motion capture with pose manifold sampling and dense contact guidance. In European Conference on Computer Vision (ECCV), pages 516–533, 2022.
  61. PhysCap: Physically plausible monocular 3D motion capture in real time. Transactions on Graphics (TOG), 39(6):1–16, 2020.
  62. Body Talk: Crowdshaping realistic 3D avatars with words. Transactions on Graphics (TOG), 35(4), 2016.
  63. Human mesh recovery from monocular images via a skeleton-disentangled representation. In International Conference on Computer Vision (ICCV), pages 5348–5357, 2019.
  64. GOAL: Generating 4D whole-body motion for hand-object grasping. In Computer Vision and Pattern Recognition (CVPR), 2022.
  65. GRAB: A dataset of whole-body human grasping of objects. In European Conference on Computer Vision (ECCV), volume 12349, pages 581–600, 2020.
  66. Learning to fuse 2D and 3D image cues for monocular body pose estimation. In International Conference on Computer Vision (ICCV), pages 3961–3970, 2017.
  67. 3D human pose estimation via intuitive physics. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 4713–4725, 2023.
  68. Attention is all you need. In Conference on Neural Information Processing Systems (NeurIPS), pages 5998–6008, 2017.
  69. Deep high-resolution representation learning for visual recognition. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(10):3349–3364, 2021.
  70. Deep contextual attention for human-object interaction detection. In International Conference on Computer Vision (ICCV), pages 5694–5702, 2019.
  71. Binge watching: Scaling affordance learning from sitcoms. In Computer Vision and Pattern Recognition (CVPR), pages 2596–2605, 2017.
  72. Holistic 3D human and scene mesh estimation from single view images. In Computer Vision and Pattern Recognition (CVPR), pages 334–343, 2020.
  73. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
  74. CHORE: Contact, human and object reconstruction from a single rgb image. In European Conference on Computer Vision (ECCV), pages 125–145. Springer, 2022.
  75. Visibility aware human-object interaction tracking from single rgb camera. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4757–4768, June 2023.
  76. Learning to detect human-object interactions with knowledge. In Computer Vision and Pattern Recognition (CVPR), 2019.
  77. ViTPose: Simple vision transformer baselines for human pose estimation. In Conference on Neural Information Processing Systems (NeurIPS), 2022.
  78. Scene constraints-aided tracking of human body. In Computer Vision and Pattern Recognition (CVPR), pages 151–156 vol.1, 2000.
  79. MIME: Human-aware 3D scene generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 12965–12976, June 2023.
  80. SimPoE: Simulated character control for 3D human pose estimation. In Computer Vision and Pattern Recognition (CVPR), pages 7159–7169, 2021.
  81. Monocular 3D pose and shape estimation of multiple people in natural scenes – the importance of multiple scene constraints. In Computer Vision and Pattern Recognition (CVPR), pages 2148–2157, 2018.
  82. PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In International Conference on Computer Vision (ICCV), pages 11426–11436, 2021.
  83. Perceiving 3D human-object spatial arrangements from a single image in the wild. In European Conference on Computer Vision (ECCV), pages 34–51. Springer, 2020.
  84. Learning motion priors for 4D human body capture in 3D scenes. In International Conference on Computer Vision (ICCV), pages 11343–11353, 2021.
  85. PLACE: Proximity learning of articulation and contact in 3D environments. In International Conference on 3D Vision (3DV), pages 642–651, 2020.
  86. Generating 3D people in scenes without people. In Computer Vision and Pattern Recognition (CVPR), pages 6193–6203, 2020.
  87. Inferring forces and learning human utilities from videos. In Computer Vision and Pattern Recognition (CVPR), pages 3823–3833, 2016.
  88. End-to-end human object interaction detection with HOI transformer. In Computer Vision and Pattern Recognition (CVPR), pages 11825–11834, 2021.
  89. Reducing footskate in human motion reconstruction with ground contact constraints. In Winter Conference on Applications of Computer Vision (WACV), pages 459–468, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Shashank Tripathi (14 papers)
  2. Agniv Chatterjee (3 papers)
  3. Jean-Claude Passy (15 papers)
  4. Hongwei Yi (28 papers)
  5. Dimitrios Tzionas (35 papers)
  6. Michael J. Black (163 papers)
Citations (15)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com