Listener Model for the PhotoBook Referential Game with CLIPScores as Implicit Reference Chain (2306.09607v1)
Abstract: PhotoBook is a collaborative dialogue game where two players receive private, partially-overlapping sets of images and resolve which images they have in common. It presents machines with a great challenge to learn how people build common ground around multimodal context to communicate effectively. Methods developed in the literature, however, cannot be deployed to real gameplay since they only tackle some subtasks of the game, and they require additional reference chains inputs, whose extraction process is imperfect. Therefore, we propose a reference chain-free listener model that directly addresses the game's predictive task, i.e., deciding whether an image is shared with partner. Our DeBERTa-based listener model reads the full dialogue, and utilizes CLIPScore features to assess utterance-image relevance. We achieve >77% accuracy on unseen sets of images/game themes, outperforming baseline by >17 points.
- SPICE: Semantic propositional image caption evaluation. In Proc. ECCV.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.
- Susan E Brennan and Herbert H Clark. 1996. Conceptual pacts and lexical choice in conversation. Journal of Experimental Psychology: Learning, Memory, and Cognition.
- Herbert H Clark and Deanna Wilkes-Gibbs. 1986. Referring as a collaborative process. Cognition.
- Guesswhat?! visual object discovery through multi-modal dialogue. In Proc. CVPR.
- BERT: Pre-training of deep bidirectional Transformers for language understanding. In Proc. NAACL.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. ICLR.
- Michael C Frank and Noah D Goodman. 2012. Predicting pragmatic reasoning in language games. Science.
- Reference-centric models for grounded collaborative dialogue. In Proc. EMNLP.
- The photobook dataset: Building common ground through visually-grounded dialogue. In Proc. ACL.
- Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings. In Proc. ACL.
- Towards a unified view of parameter-efficient transfer learning. In Proc. ICLR.
- DeBERTa: Decoding-enhanced BERT with disentangled attention. In Proc. ICLR.
- CLIPScore: a reference-free evaluation metric for image captioning. In Proc. EMNLP.
- Parameter-efficient transfer learning for NLP. In Proc. ICML.
- LoRA: Low-rank adaptation of large language models. In Proc. ICLR.
- Visual Genome: Connecting language and vision using crowdsourced dense image annotations. IJCV.
- Microsoft COCO: Common objects in context. In Proc. ECCV.
- Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In Proc. ICLR.
- ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proc. NeurIPS.
- FiLM: Visual reasoning with a general conditioning layer. In Proc. AAAI.
- Learning transferable visual models from natural language supervision. In Proc. ICML.
- SuperGLUE: Learning feature matching with graph neural networks. In Proc. CVPR.
- Refer, reuse, reduce: Generating subsequent references in visual and conversational contexts. In Proc. EMNLP.
- Learning better visual dialog agents with pretrained visual-linguistic representation. In Proc. CVPR.
- Takuma Udagawa and Akiko Aizawa. 2019. A natural language corpus of common grounding under continuous and partially-observable context. In Proc. AAAI.
- OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proc. ICML.
- Shih-Lun Wu and Yi-Hsuan Yang. 2022. MuseMorphose: Full-song and fine-grained piano music style transfer with one Transformer VAE. IEEE/ACM TASLP.
- SegFormer: simple and efficient design for semantic segmentation with transformers. In Proc. NeurIPS.
- BERTScore: Evaluating text generation with BERT. In Proc. ICLR.
- Scene parsing through ADE20k dataset. In Proc. CVPR.