AffordanceLLM: Grounding Affordance from Vision Language Models (2401.06341v2)
Abstract: Affordance grounding refers to the task of finding the area of an object with which one can interact. It is a fundamental but challenging task, as a successful solution requires the comprehensive understanding of a scene in multiple aspects including detection, localization, and recognition of objects with their parts, of geo-spatial configuration/layout of the scene, of 3D shapes and physics, as well as of the functionality and potential interaction of the objects and humans. Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set. In this paper, we make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge from pretrained large-scale vision LLMs. Under the AGD20K benchmark, our proposed model demonstrates a significant performance gain over the competing methods for in-the-wild object affordance grounding. We further demonstrate it can ground affordance for objects from random Internet images, even if both objects and actions are unseen during training. Project site: https://jasonqsy.github.io/AffordanceLLM/
- Human-to-robot imitation in the wild. In RSS, 2022.
- Affordances from human videos as a versatile representation for robotics. In CVPR, 2023.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Fwd: Real-time novel view synthesis with forward warping and depth. In CVPR, 2022.
- End-to-end object detection with transformers. In ECCV, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Hico: A benchmark for recognizing human-object interactions in images. In ICCV, 2015.
- Affordance grounding from demonstration video to target image. In CVPR, 2023.
- Learning single-image depth from videos using quality assessment networks. In CVPR, 2019.
- Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
- Think, act, and ask: Open-world interactive personalized robot navigation. arXiv preprint arXiv:2310.07968, 2023.
- Virtex: Learning visual representations from textual annotations. In CVPR, 2021.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, 2015.
- Demo2vec: Reasoning object affordances from online videos. In CVPR, 2018.
- Detecting and recognizing human-object interactions. In CVPR, 2018.
- Mesh r-cnn. In ICCV, 2019.
- Rvt: Robotic view transformer for 3d object manipulation. arXiv preprint arXiv:2306.14896, 2023.
- Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
- One-shot transfer of affordance regions? affcorrs! In Conference on Robot Learning, pages 550–560. PMLR, 2023.
- 3d-llm: Injecting the 3d world into large language models. arXiv preprint arXiv:2307.12981, 2023.
- Ditto in the house: Building articulation models of indoor scenes through interactive perception. In ICRA, 2023.
- Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020.
- Affordpose: A large-scale dataset of hand-object interactions with affordance-driven hand pose. In ICCV, 2023.
- Evo-nerf: Evolving nerf for sequential robot grasping of transparent objects. In 6th Annual Conference on Robot Learning, 2022.
- Segment anything. In ICCV, 2023.
- Putting people in their place: Affordance-aware human insertion into scenes. In CVPR, 2023.
- Visual memory for robust path following. In NeurIPS, 2018.
- Language-driven semantic segmentation. In ICLR, 2022.
- Locate: Localize and transfer object parts for weakly supervised affordance grounding. In CVPR, 2023a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
- Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. arXiv preprint arXiv:1810.01566, 2018.
- Focal loss for dense object detection. In ICCV, 2017.
- PlaneRCNN: 3D plane detection and reconstruction from a single image. In CVPR, 2019.
- Visual instruction tuning. In NeurIPS, 2023a.
- Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023b.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023c.
- Image segmentation using text and image prompts. In CVPR, 2022.
- Grounded affordance from exocentric view. arXiv preprint arXiv:2208.13196, 2022a.
- Learning affordance grounding from exocentric images. In CVPR, 2022b.
- Leverage interactive affinity for affordance learning. In CVPR, 2023.
- Codet: Co-occurrence guided region-word alignment for open-vocabulary object detection. In NeurIPS, 2023.
- Simple open-vocabulary object detection. In ECCV, 2022.
- Multi-label affordance mapping from egocentric vision. In ICCV, 2023.
- Grounded human-object interaction hotspots from video. In ICCV, 2019.
- Total3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In CVPR, 2020.
- OpenAI. Gpt-4 technical report, 2023.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Understanding 3d object interaction from a single image. In ICCV, 2023.
- Associative3d: Volumetric reconstruction from sparse views. In ECCV, 2020.
- Recognizing scenes from novel viewpoints. arXiv preprint arXiv:2112.01520, 2021.
- Learning transferable visual models from natural language supervision. In ICML, 2021a.
- Learning transferable visual models from natural language supervision. In ICML, 2021b.
- Vision transformers for dense prediction. In ICCV, 2021.
- Understanding human hands in contact at internet scale. In CVPR, 2020.
- Edadet: Open-vocabulary object detection using early dense alignment. In ICCV, 2023.
- Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022.
- Designing deep networks for surface normal estimation. In CVPR, 2015.
- Learning foresightful dense visual affordance for deformable object manipulation. In ICCV, 2023.
- Masqclip for open-vocabulary universal image segmentation. In ICCV, 2023.
- Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent. arXiv preprint arXiv:2309.12311, 2023a.
- Grounding 3d object affordance from 2d interactions in images. In ICCV, 2023b.
- Affordance diffusion: Synthesizing hand-object interactions. In CVPR, 2023.
- Homerobot: Open-vocabulary mobile manipulation. arXiv preprint arXiv:2306.11565, 2023.
- Enforcing geometric constraints of virtual normal for depth prediction. In ICCV, 2019.
- Learning to recover 3d scene shape from a single image. In CVPR, 2021.
- pixelnerf: Neural radiance fields from one or few images. In CVPR, 2021.
- Does computer vision matter for action? Science Robotics, 4(30), 2019.
- Navgpt: Explicit reasoning in vision-and-language navigation with large language models. arXiv preprint arXiv:2305.16986, 2023.