What does CLIP know about peeling a banana? (2404.12015v1)
Abstract: Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes AffordanceCLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-LLMs like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models.
- Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
- Affordances in robotic tasks–a survey. arXiv preprint arXiv:2004.07400, 2020.
- Neural mechanisms of observational learning. Proceedings of the National Academy of Sciences, 107(32):14431–14436, 2010.
- Learning to act properly: Predicting and explaining affordances from images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 975–983, 2018.
- Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977.
- Affordancenet: An end-to-end deep learning approach for object affordance detection. In 2018 IEEE international conference on robotics and automation (ICRA), pages 5882–5889. IEEE, 2018.
- Ts-cam: Token semantic coupled attention map for weakly supervised object localization. In ICCV, 2021.
- Visual affordance and function understanding: A survey. ACM Computing Surveys (CSUR), 54(3):1–35, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016a.
- Deep residual learning for image recognition. In CVPR, 2016b.
- Learning human activities and object affordances from rgb-d videos. The International journal of robotics research, 32(8):951–970, 2013.
- Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546, 2022.
- Locate: Localize and transfer object parts for weakly supervised affordance grounding. In CVPR, 2023.
- Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Image segmentation using text and image prompts. in 2022 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7076–7086, 2021.
- Learning affordance grounding from exocentric images. In CVPR, 2022.
- Grounded affordance from exocentric view. International Journal of Computer Vision, pages 1–25, 2023.
- Codet: Co-occurrence guided region-word alignment for open-vocabulary object detection. Advances in Neural Information Processing Systems, 36, 2024.
- Erasing integrated learning: A simple yet effective approach for weakly supervised object localization. In CVPR, 2020.
- Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
- Simple open-vocabulary object detection. In European Conference on Computer Vision, pages 728–755. Springer, 2022.
- Affordance detection of tool parts from geometric features. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1374–1381. IEEE, 2015.
- Modeling context between objects for referring expression understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 792–807. Springer, 2016.
- Grounded human-object interaction hotspots from video. In ICCV, 2019a.
- Grounded human-object interaction hotspots from video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8688–8697, 2019b.
- Object-based affordances detection with convolutional neural networks and dense conditional random fields. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5908–5915. IEEE, 2017.
- What can a cook in italy teach a mechanic in india? action recognition generalisation over scenarios and locations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13656–13666, 2023.
- Affordancellm: Grounding affordance from vision language models. arXiv preprint arXiv:2401.06341, 2024.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18082–18091, 2022.
- Adaptive binarization for weakly supervised affordance segmentation. In ICCV, 2017.
- Weakly supervised affordance detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2795–2804, 2017.
- Attention is all you need. NeurIPS, 2017.
- Cris: Clip-driven referring image segmentation. In CVPR, 2022.
- Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation. In ICCV, 2023.
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
- Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18155–18165, 2022.
- Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
- Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.
- Extract free dense labels from clip. In European Conference on Computer Vision, pages 696–712. Springer, 2022.
- Seqtr: A simple yet universal network for visual grounding. In European Conference on Computer Vision, pages 598–615. Springer, 2022.