Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pragmatic Inference with a CLIP Listener for Contrastive Captioning (2306.08818v1)

Published 15 Jun 2023 in cs.CL

Abstract: We propose a simple yet effective and robust method for contrastive captioning: generating discriminative captions that distinguish target images from very similar alternative distractor images. Our approach is built on a pragmatic inference procedure that formulates captioning as a reference game between a speaker, which produces possible captions describing the target, and a listener, which selects the target given the caption. Unlike previous methods that derive both speaker and listener distributions from a single captioning model, we leverage an off-the-shelf CLIP model to parameterize the listener. Compared with captioner-only pragmatic models, our method benefits from rich vision language alignment representations from CLIP when reasoning over distractors. Like previous methods for discriminative captioning, our method uses a hyperparameter to control the tradeoff between the informativity (how likely captions are to allow a human listener to discriminate the target image) and the fluency of the captions. However, we find that our method is substantially more robust to the value of this hyperparameter than past methods, which allows us to automatically optimize the captions for informativity - outperforming past methods for discriminative captioning by 11% to 15% accuracy in human evaluations

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Jacob Andreas and Dan Klein. 2016. Reasoning about pragmatics with neural listeners and speakers. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1173–1182, Austin, Texas. Association for Computational Linguistics.
  2. Microsoft coco captions: Data collection and evaluation server. arXiv preprint, arXiv:1504.00325.
  3. Fine-grained image captioning with CLIP reward. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 517–527, Seattle, United States. Association for Computational Linguistics.
  4. Reuben Cohn-Gordon and Noah Goodman. 2019. Lost in machine translation: A method to reduce meaning loss. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 437–441, Minneapolis, Minnesota. Association for Computational Linguistics.
  5. Pragmatically informative image captioning with character-level inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 439–443, New Orleans, Louisiana. Association for Computational Linguistics.
  6. Communication breakdown: On the low mutual intelligibility between human and neural captioning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7998–8007, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  7. Michael C Frank and Noah D Goodman. 2012. Predicting Pragmatic Reasoning in Language Games. Science, 336(6084):998–998.
  8. Unified pragmatic models for generating and following instructions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1951–1963, New Orleans, Louisiana. Association for Computational Linguistics.
  9. Reference-centric models for grounded collaborative dialogue. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2130–2147, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  10. Noah D Goodman and Michael C Frank. 2016. Pragmatic Language Interpretation as Probabilistic Inference. Trends in Cognitive Sciences, 20(11):818–829.
  11. CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  12. Switching to discriminative image captioning by relieving a bottleneck of reinforcement learning. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1124–1134.
  13. Will I sound like me? improving persona consistency in dialogues through pragmatic self-consciousness. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 904–916, Online. Association for Computational Linguistics.
  14. CoDraw: Collaborative drawing as a testbed for grounded goal-driven communication. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6495–6513, Florence, Italy. Association for Computational Linguistics.
  15. Image retrieval from contextual descriptions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3426–3440, Dublin, Ireland. Association for Computational Linguistics.
  16. Multi-agent communication meets natural language: Synergies between functional and structural language learning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7663–7674, Online. Association for Computational Linguistics.
  17. Align before fuse: Vision and language representation learning with momentum distillation. In Neural Information Processing Systems.
  18. Generating diverse and descriptive image captions using visual paraphrases. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4239–4248.
  19. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XV, pages 353–369.
  20. Discriminability objective for training descriptive captions. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6964–6974.
  21. Ruotian Luo and Gregory Shakhnarovich. 2017. Comprehension-guided referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7102–7111.
  22. Rethinking the reference-based distinctive image captioning. In Proceedings of the 30th ACM International Conference on Multimedia.
  23. Colors in context: A pragmatic neural model for grounded language understanding. Transactions of the Association for Computational Linguistics, 5:325–338.
  24. Communication-based evaluation for natural language generation. In Proceedings of the Society for Computation in Linguistics 2020, pages 116–126, New York, New York. Association for Computational Linguistics.
  25. Pragmatic issue-sensitive image captioning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1924–1938, Online. Association for Computational Linguistics.
  26. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning.
  27. Language models are unsupervised multitask learners.
  28. Pragmatically informative text generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4060–4067, Minneapolis, Minnesota. Association for Computational Linguistics.
  29. Less descriptive yet discriminative: Quantifying the properties of multimodal referring utterances via CLIP. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 36–42, Dublin, Ireland. Association for Computational Linguistics.
  30. Context-aware captions from context-agnostic supervision. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1070–1079.
  31. Compare and reweight: Distinctive image captioning using similar images sets. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, pages 370–386.
  32. Group-based distinctive image captioning with memory attention. In Proceedings of the 29th ACM International Conference on Multimedia.
  33. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jiefu Ou (9 papers)
  2. Benno Krojer (8 papers)
  3. Daniel Fried (69 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.