Monocular Human-Object Reconstruction in the Wild (2407.20566v2)
Abstract: Learning the prior knowledge of the 3D human-object spatial relation is crucial for reconstructing human-object interaction from images and understanding how humans interact with objects in 3D space. Previous works learn this prior from datasets collected in controlled environments, but due to the diversity of domains, they struggle to generalize to real-world scenarios. To overcome this limitation, we present a 2D-supervised method that learns the 3D human-object spatial relation prior purely from 2D images in the wild. Our method utilizes a flow-based neural network to learn the prior distribution of the 2D human-object keypoint layout and viewports for each image in the dataset. The effectiveness of the prior learned from 2D images is demonstrated on the human-object reconstruction task by applying the prior to tune the relative pose between the human and the object during the post-optimization stage. To validate and benchmark our method on in-the-wild images, we collect the WildHOI dataset from the YouTube website, which consists of various interactions with 8 objects in real-world scenarios. We conduct the experiments on the indoor BEHAVE dataset and the outdoor WildHOI dataset. The results show that our method achieves almost comparable performance with fully 3D supervised methods on the BEHAVE dataset, even if we have only utilized the 2D layout information, and outperforms previous methods in terms of generality and interaction diversity on in-the-wild images.
- BEHAVE: Dataset and Method for Tracking Human Object Interactions. In CVPR. 15935–15946.
- BigDetection: A Large-Scale Benchmark for Improved Object Detector Pre-Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 4777–4787.
- SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation. In NeurIPS, Vol. 36. 11454–11468.
- Reconstructing Hand-Object Interactions in the Wild. In ICCV. 12417–12426.
- Unsupervised 3D Pose Estimation With Geometric Self-Supervision. In CVPR.
- EPro-PnP: Generalized End-to-End Probabilistic Perspective-N-Points for Monocular Object Pose Estimation. In CVPR. 2781–2790.
- Object-Centric Domain Randomization for 3D Shape Reconstruction in the Wild. arXiv preprint arXiv:2403.14539 (2024).
- Efficient k-nearest neighbor graph construction for generic similarity measures. In The Web Conference.
- 3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models. arXiv preprint arXiv:2403.11111 (2024).
- Learning 3D Object Shape and Layout without 3D Supervision. CVPR (2022), 1685–1694.
- Sookwan Han and Hanbyul Joo. 2023. Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images. In ICCV.
- Learning Joint Reconstruction of Hands and Manipulated Objects. In CVPR.
- StackFLOW: Monocular Human-Object Reconstruction by Stacked Normalizing Flow with Offset. In IJCAI.
- Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation. 2021 International Conference on 3D Vision (3DV) (2020), 42–52.
- End-to-end Recovery of Human Shape and Pose. In CVPR.
- Durk P Kingma and Prafulla Dhariwal. 2018. Glow: Generative Flow with Invertible 1x1 Convolutions. In NeurIPS, Vol. 31.
- Segment Anything. In ICCV. 4015–4026.
- Probabilistic Modeling for Human Mesh Recovery. In ICCV.
- CDPN: Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF Object Pose Estimation. In ICCV.
- Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset. NeurIPS (2023).
- One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer. CVPR (2023).
- Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Learning 3D Scene Priors with 2D Supervision. CVPR (2022), 792–802.
- Deep Mesh Reconstruction From Single RGB Images via Topology Modification Networks. In ICCV.
- Learning Hand-Held Object Reconstruction from In-The-Wild Videos. ArXiv abs/2305.03036 (2023).
- Danilo Jimenez Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (ICML’15). 1530–1538.
- Reconstructing Action-Conditioned Human-Object Interactions Using Commonsense Knowledge Priors. 2022 International Conference on 3D Vision (3DV) (2022), 353–362.
- Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation. In CVPR.
- CHORE: Contact, Human and Object REconstruction from a single RGB image. In ECCV. 125–145.
- Visibility Aware Human-Object Interaction Tracking from Single RGB Camera. In CVPR.
- ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation. In NeurIPS.
- Towards Alleviating the Modeling Ambiguity of Unsupervised Monocular 3D Human Pose Estimation. ICCV (2021), 8631–8631.
- Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild. In ECCV. 34–51.