RHOBIN Challenge: Reconstruction of Human Object Interaction (2401.04143v1)
Abstract: Modeling the interaction between humans and objects has been an emerging research direction in recent years. Capturing human-object interaction is however a very challenging task due to heavy occlusion and complex dynamics, which requires understanding not only 3D human pose, and object pose but also the interaction between them. Reconstruction of 3D humans and objects has been two separate research fields in computer vision for a long time. We hence proposed the first RHOBIN challenge: reconstruction of human-object interactions in conjunction with the RHOBIN workshop. It was aimed at bringing the research communities of human and object reconstruction as well as interaction modeling together to discuss techniques and exchange ideas. Our challenge consists of three tracks of 3D reconstruction from monocular RGB images with a focus on dealing with challenging interaction scenarios. Our challenge attracted more than 100 participants with more than 300 submissions, indicating the broad interest in the research communities. This paper describes the settings of our challenge and discusses the winning methods of each track in more detail. We observe that the human reconstruction task is becoming mature even under heavy occlusion settings while object pose estimation and joint reconstruction remain challenging tasks. With the growing interest in interaction modeling, we hope this report can provide useful insights and foster future research in this direction. Our workshop website can be found at \href{https://rhobin-challenge.github.io/}{https://rhobin-challenge.github.io/}.
- https://virtualhumans.mpi-inf.mpg.de/3dpw_challenge/.
- https://rhobin-challenge.github.io/.
- Behave: Dataset and method for tracking human object interactions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conf. on Computer Vision. Springer International Publishing, 2016.
- ContactDB: Analyzing and predicting grasp contact via thermal imaging. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019a.
- Contactgrasp: Functional multi-finger grasp synthesis from contact. In IROS, 2019b.
- ContactPose: A dataset of grasps with object contact and hand pose. In The European Conference on Computer Vision (ECCV), 2020.
- Implicit functions in feature space for 3d shape reconstruction and completion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020.
- Ganhand: Predicting human grasp affordances in multi-object scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
- SO-Pose: Exploiting Self-Occlusion for Direct 6D Pose Estimation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 12376–12385, Montreal, QC, Canada, 2021. IEEE.
- Use the force, luke! learning to predict physical forces by simulating effects. In CVPR, 2020.
- Three-dimensional reconstruction of human interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021.
- In-hand 3d object scanning from an rgb sequence. CVPR, 2023.
- Resolving 3d human pose ambiguities with 3d scene constraints. In International Conference on Computer Vision, 2019.
- Learning joint reconstruction of hands and manipulated objects. In CVPR, 2019.
- Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Computer Vision – ACCV 2012, pages 548–562, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
- BOP: Benchmark for 6D object pose estimation. European Conference on Computer Vision (ECCV), 2018.
- Single-stage 6d object pose estimation. In CVPR, 2020.
- InterCap: Joint markerless 3D tracking of humans and objects in interaction. In German Conference on Pattern Recognition (GCPR), pages 281–299. Springer, 2022.
- TeCH: Text-guided Reconstruction of Lifelike Clothed Humans. In International Conference on 3D Vision (3DV), 2024.
- End-to-end recovery of human shape and pose. In Computer Vision and Pattern Recognition (CVPR), 2018.
- Grasping field: Learning implicit representations for human grasps. In 8th International Conference on 3D Vision, pages 333–344. IEEE, 2020.
- 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
- Adam: A method for stochastic optimization. In ICLR, 2014.
- VIBE: Video inference for human body pose and shape estimation. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 5252–5262. IEEE, 2020.
- Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, 2019.
- Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3383–3393, 2021.
- Monocular real-time volumetric performance capture. arXiv preprint arXiv:2007.13988, 2020.
- Deepim: Deep iterative matching for 6d pose estimation. In European Conference on Computer Vision (ECCV), 2018.
- TADA! Text to Animatable Digital Avatars. In International Conference on 3D Vision (3DV), 2024.
- Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images. In ECCV, 2022.
- SMPL: A skinned multi-person linear model. In ACM TOG, 2015a.
- SMPL: A skinned multi-person linear model. In ACM Transactions on Graphics. ACM, 2015b.
- Shape enhanced keypoints learning with geometric prior for 6d object pose tracking. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2985–2991, 2022.
- Deepsdf: Learning continuous signed distance functions for shape representation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 165–174, 2019a.
- Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7668–7677, 2019b.
- Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
- PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4556–4565, Long Beach, CA, USA, 2019. IEEE.
- Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration. In IEEE International Conference on Computer Vision Workshops, 2021.
- Tracking by 3d model estimation of unknown objects in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14086–14096, 2023.
- Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In IEEE International Conference on Computer Vision (ICCV). IEEE, 2019.
- Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
- OnePose: One-shot object pose estimation without CAD models. CVPR, 2022a.
- Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
- Monocular, One-stage, Regression of Multiple 3D People. In ICCV, 2021.
- Putting people in their place: Monocular regression of 3d people in depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13243–13252, 2022b.
- GRAB: A dataset of whole-body human grasping of objects. In European Conference on Computer Vision (ECCV), 2020.
- Recovering 3d human mesh from monocular images: A survey. arXiv preprint arXiv:2203.01923, 2022.
- Capturing hands in action using discriminative salient points and physics simulation. International Journal of Computer Vision (IJCV), 2016.
- Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), 2018.
- Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16611–16621, 2021a.
- GDR-Net: Geometry-guided direct regression network for monocular 6d object pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16611–16621, 2021b.
- Normalized object coordinate space for category-level 6d object pose and size estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019a.
- Deep high-resolution representation learning for visual recognition. TPAMI, 2019b.
- BundleTrack: 6D Pose Tracking for Novel Objects without Instance or Category-Level 3D Models. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8067–8074, Prague, Czech Republic, 2021. IEEE Press.
- BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects. In CVPR, 2023.
- Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. 2018.
- Chore: Contact, human and object reconstruction from a single rgb image. In European Conference on Computer Vision (ECCV). Springer, 2022.
- Template free reconstruction of human-object interaction with procedural interaction generation. In ArXiv, 2023a.
- Visibility aware human-object interaction tracking from single rgb camera. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
- ICON: Implicit Clothed humans Obtained from Normals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13296–13306, 2022.
- ECON: Explicit Clothed humans Optimized via Normal integration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
- Lasr: Learning articulated shape reconstruction from a monocular video. In CVPR, 2021a.
- Banmo: Building animatable 3d neural models from many casual videos. In CVPR, 2022.
- Reconstructing animatable categories from videos. In CVPR, 2023.
- CPF: Learning a contact potential field to model the hand-object interaction. In ICCV, 2021b.
- Diffusion-guided reconstruction of everyday hand-object interaction clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19717–19728, 2023.
- Human-aware object placement for visual environment reconstruction. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 3959–3970, 2022.
- Glamr: Global occlusion-aware human mesh recovery with dynamic cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Neuraldome: A neural modeling pipeline on multi-view human-object interactions. In CVPR, 2023.
- Perceiving 3d human-object spatial arrangements from a single image in the wild. In European Conference on Computer Vision (ECCV), 2020.
- Learning to Reconstruct Shapes From Unseen Classes. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
- Couch: Towards controllable human-chair interactions. In European Conference on Computer Vision (ECCV). Springer, 2022.
- Toch: Spatio-temporal object correspondence to hand for motion refinement. In European Conference on Computer Vision (ECCV). Springer, 2022.
- On the continuity of rotation representations in neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.