Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Resolving References in Visually-Grounded Dialogue via Text Generation (2309.13430v1)

Published 23 Sep 2023 in cs.CL, cs.AI, and cs.CV

Abstract: Vision-LLMs (VLMs) have shown to be effective at image retrieval based on simple text queries, but text-image retrieval based on conversational input remains a challenge. Consequently, if we want to use VLMs for reference resolution in visually-grounded dialogue, the discourse processing capabilities of these models need to be augmented. To address this issue, we propose fine-tuning a causal LLM to generate definite descriptions that summarize coreferential information found in the linguistic context of references. We then use a pretrained VLM to identify referents based on the generated descriptions, zero-shot. We evaluate our approach on a manually annotated dataset of visually-grounded dialogues and achieve results that, on average, exceed the performance of the baselines we compare against. Furthermore, we find that using referent descriptions based on larger context windows has the potential to yield higher returns.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. History for Visual Dialog: Do we really need it? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8182–8197, Online. Association for Computational Linguistics.
  2. Investigating Query Expansion and Coreference Resolution in Question Answering on BERT. In Natural Language Processing and Information Systems, pages 47–59, Cham. Springer International Publishing.
  3. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  4. PaLM: Scaling Language Modeling with Pathways.
  5. Visual Referring Expression Recognition: What Do Systems Actually Learn? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 781–787, New Orleans, Louisiana. Association for Computational Linguistics.
  6. Herbert H. Clark and Deanna Wilkes-Gibbs. 1986. Referring as a collaborative process. Cognition, 22(1):1–39.
  7. Visual Dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1080–1089.
  8. GuessWhat?! Visual Object Discovery through Multi-modal Dialogue. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4466–4475.
  9. The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1895–1910, Florence, Italy. Association for Computational Linguistics.
  10. Meet Up! A Corpus of Joint Activity Dialogues in a Visual Environment. In Proceedings of the 23rd Workshop on the Semantics and Pragmatics of Dialogue - Full Papers, London, United Kingdom. SEMDIAL.
  11. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR.
  12. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics, 8:64–77.
  13. Deep Visual-Semantic Alignments for Generating Image Descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  14. A Discriminative Model for Perceptually-Grounded Incremental Reference Resolution. In Proceedings of the 11th International Conference on Computational Semantics, pages 195–205, London, UK. Association for Computational Linguistics.
  15. Casey Kennington and David Schlangen. 2017. A simple generative model of incremental reference resolution for situated dialogue. Computer Speech & Language, 41:43–67.
  16. Higher-Order Coreference Resolution with Coarse-to-Fine Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 687–692, New Orleans, Louisiana. Association for Computational Linguistics.
  17. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.
  18. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR.
  19. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  20. Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014, pages 740–755, Cham. Springer International Publishing.
  21. Recurrent Multimodal Interaction for Referring Image Segmentation. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 1280–1289.
  22. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  23. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
  24. GECOR: An End-to-End Generative Ellipsis and Co-reference Resolution Model for Task-Oriented Dialogue. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4547–4557, Hong Kong, China. Association for Computational Linguistics.
  25. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  26. Improving language understanding by generative pre-training.
  27. Language Models are Unsupervised Multitask Learners.
  28. David Schlangen. 2019. Grounded Agreement Games: Emphasizing Conversational Grounding in Visual Dialogue Settings. CoRR, abs/1908.11279.
  29. Resolving References to Objects in Photographs using the Words-As-Classifiers Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1213–1223, Berlin, Germany. Association for Computational Linguistics.
  30. KTH Tangrams: A Dataset for Research on Alignment and Conceptual Pacts in Task-Oriented Dialogue. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  31. Todd Shore and Gabriel Skantze. 2018. Using Lexical Alignment and Referring Ability to Address Data Sparsity in Situated Dialog Reference Resolution. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2288–2297, Brussels, Belgium. Association for Computational Linguistics.
  32. Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4350–4368, Online. Association for Computational Linguistics.
  33. LLaMA: Open and Efficient Foundation Language Models.
  34. Takuma Udagawa and Akiko Aizawa. 2019. A Natural Language Corpus of Common Grounding under Continuous and Partially-Observable Context. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press. Event-place: Honolulu, Hawaii, USA.
  35. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  36. Collecting Visually-Grounded Dialogue with A Game Of Sorts. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2257–2268, Marseille, France. European Language Resources Association.
  37. Modeling Context in Referring Expressions. In Computer Vision – ECCV 2016, pages 69–85, Cham. Springer International Publishing.
  38. Grounding Referring Expressions in Images by Variational Context. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4158–4166.
  39. OPT: Open Pre-trained Transformer Language Models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Bram Willemsen (5 papers)
  2. Livia Qian (3 papers)
  3. Gabriel Skantze (29 papers)
Citations (2)