Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models (2401.12978v3)
Abstract: Understanding the inherent human knowledge in interacting with a given environment (e.g., affordance) is essential for improving AI to better assist humans. While existing approaches primarily focus on human-object contacts during interactions, such affordance representation cannot fully address other important aspects of human-object interactions (HOIs), i.e., patterns of relative positions and orientations. In this paper, we introduce a novel affordance representation, named Comprehensive Affordance (ComA). Given a 3D object mesh, ComA models the distribution of relative orientation and proximity of vertices in interacting human meshes, capturing plausible patterns of contact, relative orientations, and spatial relationships. To construct the distribution, we present a novel pipeline that synthesizes diverse and realistic 3D HOI samples given any 3D object mesh. The pipeline leverages a pre-trained 2D inpainting diffusion model to generate HOI images from object renderings and lifts them into 3D. To avoid the generation of false affordances, we propose a new inpainting framework, Adaptive Mask Inpainting. Since ComA is built on synthetic samples, it can extend to any object in an unbounded manner. Through extensive experiments, we demonstrate that ComA outperforms competitors that rely on human annotations in modeling contact-based affordance. Importantly, we also showcase the potential of ComA to reconstruct human-object interactions in 3D through an optimization framework, highlighting its advantage in incorporating both contact and non-contact properties.
- Affordances from human videos as a versatile representation for robotics. In CVPR, 2023.
- Behave: Dataset and method for tracking human object interactions. In CVPR, 2022.
- Contactdb: Analyzing and predicting grasp contact via thermal imaging. In CVPR, 2019a.
- Contactgrasp: Functional multi-finger grasp synthesis from contact. In IROS, 2019b.
- Contactpose: A dataset of grasps with object contact and hand pose. In ECCV, 2020.
- Understanding hand-object manipulation with grasp types and object attributes. In RSS, 2016.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE TPAMI, 2019.
- Long-term human motion prediction with scene context. In ECCV, 2020.
- Ensembling with deep generative views. In CVPR, 2021.
- ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], 2015.
- Dexycb: A benchmark for capturing hand grasping of objects. CVPR, 2021.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. 2023.
- Holistic++ scene understanding: Single-view 3d holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. In ICCV, 2019.
- 3d affordancenet: A benchmark for visual object affordance understanding. In CVPR, 2021.
- A. R. et al. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Articulated objects in free-form hand interaction. CVPR, 2023.
- Three-dimensional reconstruction of human interactions. In CVPR, 2020.
- Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 1981.
- W. Gao and R. Tedrake. kpam-sc: Generalizable manipulation planning using keypoint affordance and shape completion. In ICRA, 2021.
- Advanced fibonacci sequence with golden ratio. IJSER, 2014.
- J. J. Gibson. The Ecological Approach to Visual Perception. Houghton Mifflin, 1979.
- Generative adversarial nets. In NeurIPS, 2014.
- ContactOpt: Optimizing contact to improve grasps. In CVPR, 2021.
- Honnotate: A method for 3d annotation of hand and object poses. CVPR, 2020.
- Keypoint transformer: Solving joint identification in challenging hands and object interactions for accurate 3d pose estimation. CVPR, 2022.
- S. Han and H. Joo. Learning canonicalized 3d human-object spatial relations from unbounded synthesized images. In ICCV, 2023.
- Gancraft: Unsupervised 3d neural rendering of minecraft worlds. In ICCV, 2021.
- Resolving 3d human pose ambiguities with 3d scene constraints. In ICCV, 2019.
- Populating 3d scenes by learning human-scene interaction. In CVPR, 2021.
- Learning joint reconstruction of hands and manipulated objects. In CVPR, 2019.
- Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022.
- Affordance prediction via learned object attributes. In Proc. Intl. Conf. on Robotics and Automation, 2011.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239, 2020.
- Capturing and inferring dense full-body human-scene contact. In CVPR, 2022a.
- Capturing and inferring dense full-body human-scene contact. In CVPR, 2022b.
- InterCap: Joint markerless 3D tracking of humans and objects in interaction. In GCPR, 2022c.
- Generative models as a data source for multiview representation learning. arXiv preprint arXiv:2106.05258, 2021.
- Affordpose: A large-scale dataset of hand-object interactions with affordance-driven hand pose. In ICCV, 2023.
- Coherent reconstruction of multiple humans from a single image. In CVPR, 2020.
- Neuralhofusion: Neural volumetric rendering under human-object interactions. In CVPR, 2022.
- A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Pointrend: Image segmentation as rendering. In CVPR, 2020.
- Segment anything. In ICCV, 2023.
- Putting people in their place: Affordance-aware human insertion into scenes. In CVPR, 2023.
- Y. J. Lee and K. Grauman. Predicting important objects for egocentric video summarization. IJCV, 2015. URL https://api.semanticscholar.org/CorpusID:5617021.
- S. Levine and D. Shah. Learning robotic navigation from experience: principles, methods and recent results. Philosophical Transactions of the Royal Society B: Biological Sciences, 2022.
- Locate: Localize and transfer object parts for weakly supervised affordance grounding. In CVPR, 2023.
- Estimating 3d motion and forces of person-object interactions from monocular video. In CVPR, 2019.
- Magic3d: High-resolution text-to-3d content creation. In CVPR, 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- SMPL: A skinned multi-person linear model. Proc. ACM SIGGRAPH Asia, 2015.
- Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
- Cad-estate: Large-scale cad model annotation in rgb videos. In ICCV, 2023.
- Generative interventions for causal learning. In CVPR, 2021.
- M. McCool and E. Fiume. Hierarchical poisson disk sampling distributions. In Graphics interface, 1992.
- Finding an unsupervised image segmenter in each of your deep generative models. arXiv preprint arXiv:2105.08127, 2021.
- SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
- COAP: Compositional articulated occupancy of people. In CVPR, 2022.
- Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
- imapper: interaction-guided scene mapping from monocular videos. ACM TOG, 2019.
- Accurate 3d hand pose estimation for whole-body 3d human mesh estimation. In CVPRW, 2022.
- T. Nagarajan and K. Grauman. Learning affordance landscapes for interaction exploration in 3d environments. In NeurIPS, 2020.
- Signed distance fields: A natural representation for both mapping and planning. In RSS 2016 workshop, 2016.
- OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/, 2023a.
- OpenAI. Gpt-4v(ision) system card. https://openai.com/research/gpt-4v-system-card, 2023b.
- Do 2d gans know 3d shape? unsupervised 3d shape reconstruction from 2d image gans. arXiv preprint arXiv:2011.00844, 2020.
- Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019.
- Gan-supervised dense visual alignment. In CVPR, 2022.
- Object pop-up: Can we infer 3d objects and their poses from human interactions alone? In CVPR, 2023.
- Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.
- Contact and human dynamics from monocular video. In ECCV, 2020.
- Texture: Text-guided texturing of 3d shapes. In Proc. ACM SIGGRAPH, 2023.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Embodied hands: Modeling and capturing hands and bodies together. Proc. ACM SIGGRAPH Asia, 2017.
- Pigraphs: learning interaction snapshots from observations. ACM TOG, 2016.
- SG_161222. Realistic vision v5.1. https://civitai.com/models/4201?modelVersionId=130090, 2023.
- C. E. Shannon. A mathematical theory of communication. ACM SIGMOBILE, 2001.
- Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In NeurIPS, 2021.
- Sketchfab. Sketchfab. https://sketchfab.com/, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Denoising diffusion implicit models. arXiv:2010.02502, 2020. URL https://arxiv.org/abs/2010.02502.
- Neural free-viewpoint performance rendering under complex human-object interactions. In ACM MM, 2021.
- Color indexing. IJCV, 1991.
- Grab: A dataset of whole-body human grasping of objects. In ECCV, 2020.
- F. H. K. d. S. Tanaka and C. Aranha. Data augmentation using gans. arXiv preprint arXiv:1904.09135, 2019.
- Effective data augmentation with diffusion models. arXiv preprint arXiv:2302.07944, 2023.
- DECO: Dense estimation of 3D human-scene contact in the wild. In ICCV, 2023.
- Repurposing gans for one-shot semantic part segmentation. In CVPR, 2021.
- Datasetdm: Synthesizing data with perception annotations using diffusion models. arXiv preprint arXiv:2308.06160, 2023.
- Chore: Contact, human and object reconstruction from a single rgb image. In ECCV, 2022.
- Automated labeling for robotic autonomous navigation through multi-sensory semi-supervised learning on big data. IEEE TBD, 2021.
- InterDiff: Generating 3d human-object interactions with physics-informed diffusion. In ICCV, 2023.
- CPF: Learning a contact potential field to model the hand-object interaction. In ICCV, 2021.
- Grounding 3d object affordance from 2d interactions in images. In ICCV, 2023.
- Affordance diffusion: Synthesizing hand-object interactions. In CVPR, 2023.
- Human-aware object placement for visual environment reconstruction. In CVPR, 2022.
- Perceiving 3d human-object spatial arrangements from a single image in the wild. In ECCV, 2020a.
- Place: Proximity learning of articulation and contact in 3d environments. In 3DV, 2020b.
- Couch: Towards controllable human-chair interactions. In ECCV, 2022.
- Image gans meet differentiable rendering for inverse graphics and interpretable 3d neural rendering. arXiv preprint arXiv:2010.09125, 2020c.
- Generating 3d people in scenes without people. In CVPR, 2020d.
- Datasetgan: Efficient labeled data factory with minimal human effort. In CVPR, 2021.
- Toch: Spatio-temporal object correspondence to hand for motion refinement. In ECCV, 2022.
- Stereo magnification: Learning view synthesis using multiplane images. In Proc. ACM SIGGRAPH, 2018.
- Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. ICCV, 2019.
- Hyeonwoo Kim (13 papers)
- Sookwan Han (4 papers)
- Patrick Kwon (7 papers)
- Hanbyul Joo (37 papers)