Papers
Topics
Authors
Recent
2000 character limit reached

Implicit Affordance Acquisition via Causal Action-Effect Modeling in the Video Domain (2312.11345v1)

Published 18 Dec 2023 in cs.CL and cs.AI

Abstract: Affordance knowledge is a fundamental aspect of commonsense knowledge. Recent findings indicate that world knowledge emerges through large-scale self-supervised pretraining, motivating our exploration of acquiring affordance knowledge from the visual domain. To this end, we augment an existing instructional video resource to create the new Causal Action-Effect (CAE) dataset and design two novel pretraining tasks -- Masked Action Modeling (MAM) and Masked Effect Modeling (MEM) -- promoting the acquisition of two affordance properties in models: behavior and entity equivalence, respectively. We empirically demonstrate the effectiveness of our proposed methods in learning affordance properties. Furthermore, we show that a model pretrained on both tasks outperforms a strong image-based visual-linguistic foundation model (FLAVA) as well as pure linguistic models on a zero-shot physical reasoning probing task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Building affordance relations for robotic agents - a review.
  2. PROST: Physical reasoning about objects through space and time. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4597–4608, Online. Association for Computational Linguistics.
  3. Affordances from human videos as a versatile representation for robotics. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 1–13. IEEE.
  4. The Berkeley FrameNet project. In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1, pages 86–90, Montreal, Quebec, Canada. Association for Computational Linguistics.
  5. Steven Bird and Edward Loper. 2004. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217, Barcelona, Spain. Association for Computational Linguistics.
  6. Experience grounds language. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8718–8735, Online. Association for Computational Linguistics.
  7. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432–7439. AAAI Press.
  8. Simulating action dynamics with neural process networks.
  9. Concreteness ratings for 40 thousand generally known english word lemmas. Behavior Research Methods.
  10. Mining semantic affordances of visual object categories. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4259–4267.
  11. Learning affordances for categorizing objects and their properties. In 2010 20th International Conference on Pattern Recognition, pages 3089–3092.
  12. Learning the effects of physical actions in a multi-modal environment. In Findings of the Association for Computational Linguistics: EACL 2023, pages 133–148, Dubrovnik, Croatia. Association for Computational Linguistics.
  13. Denoising auto-encoders for learning of objects and tools affordances in continuous space.
  14. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255.
  15. Stepformer: Self-supervised step discovery and localization in instructional videos. CoRR, abs/2304.13265.
  16. Comparing trajectory and vision modalities for verb representation. CoRR, abs/2303.12737.
  17. Katrin Erk. 2007. A simple, similarity-based model for selectional preferences. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 216–223.
  18. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211.
  19. Ronald A Fisher. 1949. The design of experiments.
  20. Physical causality of action verbs in grounded language understanding. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1814–1824.
  21. What action causes this? towards naive physical action-effect prediction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 934–945.
  22. James J Gibson. 1977. The theory of affordances. Hilldale, USA, 1(2):67–82.
  23. ACT-thor: A controlled benchmark for embodied action understanding in simulated environments. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5597–5612, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  24. Visual affordance and function understanding: A survey. ACM Comput. Surv., 54(3):47:1–47:35.
  25. Deep residual learning for image recognition. and pattern recognition.
  26. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society.
  27. Finding "it": Weakly-supervised reference-aware visual grounding in instructional videos. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 5948–5957. Computer Vision Foundation / IEEE Computer Society.
  28. Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, volume 205 of Proceedings of Machine Learning Research, pages 1769–1782. PMLR.
  29. Do as I can, not as I say: Grounding language in robotic affordances. In Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, volume 205 of Proceedings of Machine Learning Research, pages 287–318. PMLR.
  30. Enhancing object, action, and effect recognition using probabilistic affordances. Adapt. Behav., 27(5):295–306.
  31. Exploring the limits of language modeling. CoRR, abs/1602.02410.
  32. The kinetics human action video dataset.
  33. Extending VerbNet with novel verb classes. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association (ELRA).
  34. AI2-THOR: an interactive 3d environment for visual AI. CoRR, abs/1712.05474.
  35. Mining youtube - A dataset for learning fine-grained action concepts from webly supervised video data. CoRR, abs/1906.01012.
  36. Beth Levin and Hovav Malka Rappaport. 2010. Lexicalized scales and verbs of scalar change.
  37. HERO: Hierarchical encoder for Video+Language omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2046–2065, Online. Association for Computational Linguistics.
  38. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  39. Daniel Loureiro and Alípio Jorge. 2018. Affordance extraction and inference based on semantic role labeling. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 91–96, Brussels, Belgium. Association for Computational Linguistics.
  40. Phrase-based affordance detection via cyclic bilateral interaction. CoRR, abs/2202.12076.
  41. One-shot affordance detection. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, pages 895–901. ijcai.org.
  42. Pretraining on interactions for learning grounded affordance representations. In Proceedings of the 11th Joint Conference on Lexical and Computational Semantics, pages 258–277, Seattle, Washington. Association for Computational Linguistics.
  43. End-to-end learning of visual representations from uncurated instructional videos. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 9876–9886. Computer Vision Foundation / IEEE.
  44. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 2630–2640. IEEE.
  45. George A. Miller. 1994. WordNet: A lexical database for English. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994.
  46. Tushar Nagarajan and Kristen Grauman. 2020. Learning affordance landscapes for interaction exploration in 3d environments. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  47. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, pages 1–9. Association for Computational Linguistics.
  48. The World of an Octopus: How Reporting Bias Influences a Language Model’s Perception of Color. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 823–835, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  49. ISP: Learning inferential selectional preferences. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 564–571.
  50. Michele Persiani and Thomas Hellström. 2019. Unsupervised inference of object affordance from text corpora. In 22nd Nordic Conference on Computational Linguistics (NoDaLiDa’19), September 30 – October 2, 2019, Turku, Finland. Association for Computational Linguistics.
  51. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
  52. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1:25–36.
  53. To afford or not to afford: A new formalization of affordances toward Affordance-Based robot control. Adapt. Behav., 15(4):447–472.
  54. Learning action-effect dynamics for hypothetical vision-language reasoning task. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5914–5924, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  55. Reasoning about actions over visual and linguistic modalities: A survey. CoRR, abs/2207.07568.
  56. Vered Shwartz and Yejin Choi. 2020. Do neural language models overcome reporting bias? In Proceedings of the 28th International Conference on Computational Linguistics, pages 6863–6870, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  57. FLAVA: A foundational language and vision alignment model. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 15617–15629. IEEE.
  58. COIN: A large-scale dataset for comprehensive instructional video analysis. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 1207–1216. Computer Vision Foundation / IEEE.
  59. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  60. Zero-shot learning-the good, the bad and the ugly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4582–4591.
  61. Partafford: Part-level affordance discovery from 3d objects. CoRR, abs/2202.13519.
  62. Situation recognition: Visual semantic role labeling for image understanding. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 5534–5542. IEEE Computer Society.
  63. PIGLeT: Language grounding through neuro-symbolic interaction in a 3D world. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2040–2050, Online. Association for Computational Linguistics.
  64. One-shot object affordance detection in the wild. Int. J. Comput. Vis., 130(10):2472–2500.
  65. Actionformer: Localizing moments of actions with transformers. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV, volume 13664 of Lecture Notes in Computer Science, pages 492–510. Springer.
  66. Procedure-aware pretraining for instructional video understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 10727–10738. IEEE.
  67. Understanding tools: Task-oriented object modeling, learning and recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 2855–2864. IEEE Computer Society.
  68. Cross-task weakly supervised learning from instructional videos. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 3537–3545. Computer Vision Foundation / IEEE.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.