Resilience through Scene Context in Visual Referring Expression Generation (2404.12289v2)
Abstract: Scene context is well known to facilitate humans' perception of visible objects. In this paper, we investigate the role of context in Referring Expression Generation (REG) for objects in images, where existing research has often focused on distractor contexts that exert pressure on the generator. We take a new perspective on scene context in REG and hypothesize that contextual information can be conceived of as a resource that makes REG models more resilient and facilitates the generation of object descriptions, and object types in particular. We train and test Transformer-based REG models with target representations that have been artificially obscured with noise to varying degrees. We evaluate how properties of the models' visual context affect their processing and performance. Our results show that even simple scene contexts make models surprisingly resilient to perturbations, to the extent that they can identify referent types even when visual information about the target is completely missing.
- Moshe Bar. 2004. Visual objects in context. Nature Reviews Neuroscience, 5(8):617–629.
- Irving Biederman. 1972. Perceiving real-world scenes. Science, 177(4043):77–80.
- Coco-stuff: Thing and stuff classes in context.
- What vision-language models ‘see’ when they see scenes.
- HL dataset: Visually-grounded description of scenes, actions and rationales. In Proceedings of the 16th International Natural Language Generation Conference, pages 293–312, Prague, Czechia. Association for Computational Linguistics.
- Pragmatically informative image captioning with character-level inference.
- Robert Dale and Ehud Reiter. 1995. Computational interpretations of the gricean maxims in the generation of referring expressions. Cognitive Science, 19(2):233–263.
- An empirical study of context in object detection. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE.
- Carolina Galleguillos and Serge Belongie. 2010. Context based object categorization: A critical survey. Computer Vision and Image Understanding, 114(6):712–722.
- Ego4d: Around the world in 3,000 hours of egocentric video. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
- Michelle R. Greene. 2013. Statistics of high-level scene context. Frontiers in Psychology, 4.
- Swapmix: Diagnosing and regularizing the over-reliance on visual context in visual question answering. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
- Deep residual learning for image recognition.
- ReferItGame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar. Association for Computational Linguistics.
- CoNAN: A complementary neighboring-based attention network for referring expression generation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1952–1962, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Panoptic segmentation.
- Emiel Krahmer and Kees van Deemter. 2012. Computational generation of referring expressions: A survey. Computational Linguistics, 38(1):173–218.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations.
- J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics, 33(1):159.
- Xiangyang Li and Shuqiang Jiang. 2018. Bundled object context for referring expressions. IEEE Transactions on Multimedia, 20(10):2749–2760.
- Microsoft coco: Common objects in context. In Computer Vision – ECCV 2014, pages 740–755, Cham. Springer International Publishing.
- Attribute-guided attention for referring expression generation and comprehension. IEEE Transactions on Image Processing, 29:5244–5258.
- Comprehension-guided referring expressions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3125–3134.
- Generation and comprehension of unambiguous object descriptions. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11–20.
- Clipcap: Clip prefix for image captioning.
- Compositional generalization in image captioning. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 87–98, Hong Kong, China. Association for Computational Linguistics.
- Aude Oliva and Antonio Torralba. 2006. Chapter 2 building the gist of a scene: the role of global image features in recognition. In Progress in Brain Research, pages 23–36. Elsevier.
- Aude Oliva and Antonio Torralba. 2007. The role of context in object recognition. Trends in Cognitive Sciences, 11(12):520–527.
- Improving the naturalness and diversity of referring expression generation models using minimum risk training. In Proceedings of the 13th International Conference on Natural Language Generation, pages 41–51, Dublin, Ireland. Association for Computational Linguistics.
- Generating unambiguous and diverse referring expressions. Computer Speech & Language, 68:101184.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Exploring tiny images: The roles of appearance and contextual information for machine and human object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(10):1978–1991.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
- Effie J. Pereira and Monica S. Castelhano. 2014. Peripheral guidance in scenes: The interaction of scene context and object content. Journal of Experimental Psychology: Human Perception and Performance, 40(5):2056–2072.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Paco: Parts and attributes of common objects. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
- Ehud Reiter and Robert Dale. 2000. Building natural language generation systems. Cambridge University Press, Cambridge, U.K. New York.
- You shall know an object by the company it keeps: An investigation of semantic representations derived from object co-occurrence in visual scenes. Neuropsychologia, 76:52–61.
- Simeon Schüz and Sina Zarrieß. 2021. Decoupling pragmatics: Discriminative decoding for referring expression generation. In Proceedings of the Reasoning and Interaction Conference (ReInAct 2021), pages 47–52, Gothenburg, Sweden. Association for Computational Linguistics.
- Simeon Schüz and Sina Zarrieß. 2023. Keeping an eye on context: Attention allocation over input partitions in referring expression generation. In Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023), pages 20–27, Prague, Czech Republic. Association for Computational Linguistics.
- Rethinking symbolic and visual context in referring expression generation. Frontiers in Artificial Intelligence, 6.
- Object naming in language and vision: A survey and a new dataset. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5792–5801, Marseille, France. European Language Resources Association.
- Humans meet models on object naming: A new dataset and analysis. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1893–1905, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Generating easy-to-understand referring expressions for target identifications. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5793–5802.
- Attention is all you need.
- Cider: Consensus-based image description evaluation.
- Melissa Le-Hoa Võ. 2021. The meaning and structure of scenes. Vision Research, 181:10–20.
- Transforming visual scene graphs to image captions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12427–12440, Toronto, Canada. Association for Computational Linguistics.
- Modeling mutual context of object and human pose in human-object interaction activities. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE.
- Xuwang Yin and Vicente Ordonez. 2017. Obj2Text: Generating visually descriptive language from object layouts. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 177–187, Copenhagen, Denmark. Association for Computational Linguistics.
- Modeling context in referring expressions. In Computer Vision – ECCV 2016, pages 69–85, Cham. Springer International Publishing.
- A joint speaker-listener-reinforcer model for referring expressions. In Computer Vision and Pattern Recognition (CVPR), volume 2.
- Sina Zarrieß and David Schlangen. 2017. Obtaining referential word meanings from visual and distributional information: Experiments on object naming. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 243–254, Vancouver, Canada. Association for Computational Linguistics.
- Sina Zarrieß and David Schlangen. 2018. Decoding strategies for neural referring expression generation. In Proceedings of the 11th International Conference on Natural Language Generation, pages 503–512, Tilburg University, The Netherlands. Association for Computational Linguistics.
- Putting visual object recognition in context. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12982–12991.