TEXT2AFFORD: Probing Object Affordance Prediction abilities of Language Models solely from Text (2402.12881v2)
Abstract: We investigate the knowledge of object affordances in pre-trained LLMs (LMs) and pre-trained Vision-LLMs (VLMs). A growing body of literature shows that PTLMs fail inconsistently and non-intuitively, demonstrating a lack of reasoning and grounding. To take a first step toward quantifying the effect of grounding (or lack thereof), we curate a novel and comprehensive dataset of object affordances -- Text2Afford, characterized by 15 affordance classes. Unlike affordance datasets collected in vision and language domains, we annotate in-the-wild sentences with objects and affordances. Experimental results reveal that PTLMs exhibit limited reasoning abilities when it comes to uncommon object affordances. We also observe that pre-trained VLMs do not necessarily capture object affordances effectively. Through few-shot fine-tuning, we demonstrate improvement in affordance knowledge in PTLMs and VLMs. Our research contributes a novel dataset for language grounding tasks, and presents insights into LM capabilities, advancing the understanding of object affordances. Codes and data are available at https://github.com/sayantan11995/Affordance
- The falcon series of open language models.
- Natural language grounding and grammar induction for robotic manipulation commands. In Proceedings of the First Workshop on Language Grounding for Robotics, pages 35–43.
- PROST: Physical reasoning about objects through space and time. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4597–4608, Online. Association for Computational Linguistics.
- A discriminative approach to grounded spoken language understanding in interactive robotics. In IJCAI, pages 2747–2753.
- Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, Online. Association for Computational Linguistics.
- Piqa: Reasoning about physical commonsense in natural language. In AAAI Conference on Artificial Intelligence.
- Query-guided regression network with context policy for phrase grounding. In Proceedings of the IEEE International Conference on Computer Vision, pages 824–832.
- Scaling instruction-finetuned language models.
- Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Instructblip: Towards general-purpose vision-language models with instruction tuning.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
- Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell., 31(10):1775–1789.
- Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4565–4574.
- David Inkyu Kim and Gaurav S Sukhatme. 2014. Semantic labeling of 3d point clouds with object affordance for robot manipulation. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 5578–5584. IEEE.
- Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR.
- Visual object-action recognition: Inferring object affordances from human demonstration. Comput. Vis. Image Underst., 115(1):81–90.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73.
- Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
- Visual instruction tuning. In NeurIPS.
- Things not written in text: Exploring spatial commonsense from visual signals. arXiv preprint arXiv:2203.08075.
- Object affordance based multimodal fusion for natural human-robot interaction. Cognitive Systems Research, 54:128–137.
- Affordance detection of tool parts from geometric features. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1374–1381. IEEE.
- Detecting object affordances with convolutional neural networks. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2765–2770. IEEE.
- Object-based affordances detection with convolutional neural networks and dense conditional random fields. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5908–5915. IEEE.
- Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Hierarchical text-conditional image generation with clip latents.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695.
- Anirban Roy and Sinisa Todorovic. 2016. A multi-scale cnn for affordance segmentation in rgb images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 186–201. Springer.
- Photorealistic text-to-image diffusion models with deep language understanding.
- Weakly supervised affordance detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2795–2804.
- Timo Schick and Hinrich Schütze. 2021. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, Online. Association for Computational Linguistics.
- Mohit Shridhar and David Hsu. 2018. Interactive visual grounding of referring expressions for human-robot interaction. arXiv preprint arXiv:1806.03831.
- Object–object interaction affordance learning. Robotics and Autonomous Systems, 62(4):487–496.
- olmpics-on what language model pre-training captures. Transactions of the Association for Computational Linguistics, 8:743–758.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
- Z-LaVI: Zero-shot language solver fueled by visual imagination. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1186–1203, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Human intention understanding based on object affordance and action classification. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1–6. IEEE.
- Reasoning about object affordances in a knowledge base representation. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, pages 408–424. Springer.
- A robustly optimized BERT pre-training approach with post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 1218–1227, Huhhot, China. Chinese Information Processing Society of China.