HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image (2309.07891v5)
Abstract: This paper presents a method to learn hand-object interaction prior for reconstructing a 3D hand-object scene from a single RGB image. The inference as well as training-data generation for 3D hand-object scene reconstruction is challenging due to the depth ambiguity of a single image and occlusions by the hand and object. We turn this challenge into an opportunity by utilizing the hand shape to constrain the possible relative configuration of the hand and object geometry. We design a generalizable implicit function, HandNeRF, that explicitly encodes the correlation of the 3D hand shape features and 2D object features to predict the hand and object scene geometry. With experiments on real-world datasets, we show that HandNeRF is able to reconstruct hand-object scenes of novel grasp configurations more accurately than comparable methods. Moreover, we demonstrate that object reconstruction from HandNeRF ensures more accurate execution of downstream tasks, such as grasping and motion planning for robotic hand-over and manipulation. Homepage: https://samsunglabs.github.io/HandNeRF-project-page/
- H+ o: Unified egocentric recognition of 3d hand-object poses and interactions. In CVPR, 2019.
- Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In CVPR, 2020.
- Semi-supervised 3d hand-object poses estimation with interactions in time. In CVPR, 2021.
- Keypoint transformer: Solving joint identification in challenging hands and object interactions for accurate 3d pose estimation. In CVPR, 2022.
- Handoccnet: Occlusion-robust 3d hand mesh estimation network. In CVPR, 2022.
- Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration. In ICCV Workshops, 2021.
- Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. In CVPR, 2019.
- Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In ECCV, 2020.
- Dexycb: A benchmark for capturing hand grasping of objects. In CVPR, 2021.
- Contrastive representation learning for hand shape estimation. In GCPR, 2022.
- Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019.
- Real-time joint tracking of a hand manipulating an object from rgb-d input. In ECCV, 2016.
- Ho-3d_v3: Improving the accuracy of hand-object annotations of the ho-3d dataset. arXiv, 2021.
- pixelnerf: Neural radiance fields from one or few images. In CVPR, 2021.
- What’s in your hands? 3d reconstruction of generic objects in hands. In CVPR, 2022.
- Mononhr: Monocular neural human renderer. In 3DV, 2022.
- Hope-net: A graph-based model for hand-object pose estimation. In CVPR, 2020.
- Understanding everyday hands in action from rgb-d images. In ICCV, 2015.
- Learning joint reconstruction of hands and manipulated objects. In CVPR, 2019.
- Reconstructing hand-object interactions in the wild. In ICCV, 2021.
- Embodied hands: Modeling and capturing hands and bodies together. In SIGGRAPH Asia, 2017.
- 3d hand shape and pose estimation from a single rgb image. In CVPR, 2019.
- Accurate 3d hand pose estimation for whole-body 3d human mesh estimation. In CVPR workshop, 2022.
- Grasping field: Learning implicit representations for human grasps. In 3DV, 2020.
- AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation. In CVPR, 2018.
- Ibrnet: Learning multi-view image-based rendering. In CVPR, 2021.
- Putting nerf on a diet: Semantically consistent few-shot view synthesis. In ICCV, 2021.
- Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In CVPR, 2022.
- H-nerf: Neural radiance fields for rendering and temporal reconstruction of humans in motion. In NeurIPS, 2021.
- Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR, 2021.
- Neural Human Performer: Learning generalizable radiance fields for human performance rendering. In NeurIPS, 2021.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Volume sweeping: Learning photoconsistency for multi-view shape reconstruction. IJCV, 2021.
- Smpl: A skinned multi-person linear model. ACM TOG, 2015.
- Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In ICCV, 2019.
- In-place scene labelling and understanding with implicit scene representation. In ICCV, 2021.
- Deep residual learning for image recognition. In CVPR, 2016.
- Spconv Contributors. Spconv: Spatially sparse convolution library. https://github.com/traveller59/spconv, 2022.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- Marching cubes: A high resolution 3d surface construction algorithm. SIGGRAPH, 1987.
- Honnotate: A method for 3d annotation of hand and object poses. In CVPR, 2020.
- Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes. In ICRA, 2021.
- Attention is all you need. In NeurIPS, 2017.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv, 2021.
- Rgb-only reconstruction of tabletop scenes for collision-free manipulator control. In IEEE IROS, 2023.
- Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
- Segment anything. arXiv, 2023.
- Understanding human hands in contact at internet scale. In CVPR, 2020.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Automatic differentiation in pytorch. 2017.