Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 80 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 87 tok/s Pro
Kimi K2 173 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment (2312.03766v2)

Published 5 Dec 2023 in cs.CL and cs.CV

Abstract: While existing image-text alignment models reach high quality binary assessments, they fall short of pinpointing the exact source of misalignment. In this paper, we present a method to provide detailed textual and visual explanation of detected misalignments between text-image pairs. We leverage LLMs and visual grounding models to automatically construct a training set that holds plausible misaligned captions for a given image and corresponding textual explanations and visual indicators. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations. Empirical results show that fine-tuning vision LLMs on our training set enables them to articulate misalignments and visually indicate them within images, outperforming strong baselines both on the binary alignment classification and the explanation generation tasks. Our method code and human curated test set are available at: https://mismatch-quest.github.io/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Adobe. Adobe firefly (https://www.adobe.com/sensei/generative-ai/firefly.html).
  2. Spice: Semantic propositional image caption evaluation, 2016.
  3. Palm 2 technical report, 2023.
  4. METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In ACL workshop on Evaluation Measures for MT and Summarization, 2005.
  5. Improving image generation with better captions, 2023.
  6. Jax: composable transformations of python+ numpy programs. 2018.
  7. Language models are few-shot learners. In Advances in Neural Information Processing Systems, pages 1877–1901. Curran Associates, Inc., 2020.
  8. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph., 42(4), 2023.
  9. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023a.
  10. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023b.
  11. Pali-x: On scaling up a multilingual vision and language model. ArXiv, abs/2305.18565, 2023c.
  12. Pali-3 vision language models: Smaller, faster, stronger, 2023d.
  13. Pali: A jointly-scaled multilingual language-image model. 2023e.
  14. X-lxmert: Paint, caption and answer questions with multi-modal transformers. In EMNLP, 2020.
  15. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. 2022.
  16. Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation. In arXiv:2310.18235, 2023a.
  17. Visual programming for text-to-image generation and evaluation. In NeurIPS, 2023b.
  18. Kevin Crowston. Amazon mechanical turk: A research tool for organizations and information systems scholars. In Shaping the Future of ICT Research. Methods and Approaches, pages 210–221, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
  19. Recognizing textual entailment: Rational, evaluation and approaches–erratum. Natural Language Engineering, 16(1):105–105, 2010.
  20. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023a.
  21. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023b.
  22. Benchmarking spatial relationships in text-to-image generation. arXiv preprint arXiv:2212.10015, 2022.
  23. Generative adversarial nets. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2014.
  24. CLIPScore: a reference-free evaluation metric for image captioning. In EMNLP, 2021.
  25. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  26. Semantic object accuracy for generative text-to-image synthesis. arXiv preprint arXiv:1910.13321, 2019.
  27. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  28. spacy: Industrial-strength natural language processing in python. 2020.
  29. Q2superscript𝑄2{Q}^{2}italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7856–7870, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
  30. TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3905–3920, Seattle, United States, 2022. Association for Computational Linguistics.
  31. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. arXiv preprint arXiv:2303.11897, 2023.
  32. A domain-specific supercomputer for training deep neural networks. Communications of the ACM, 63(7):67–78, 2020.
  33. Pick-a-pic: An open dataset of user preferences for text-to-image generation. 2023.
  34. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
  35. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, 2019.
  36. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  37. Grounded language-image pre-training. In CVPR, 2022.
  38. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
  39. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  40. Improved baselines with visual instruction tuning, 2023a.
  41. Improved baselines with visual instruction tuning, 2023b.
  42. Learning to compose visual relations. In Advances in Neural Information Processing Systems, pages 23166–23178. Curran Associates, Inc., 2021.
  43. Compositional visual generation with composable diffusion models. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVII, pages 423–439. Springer, 2022.
  44. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023c.
  45. Generating images from captions with attention. In ICLR, 2016.
  46. A very preliminary analysis of dall-e 2, 2022.
  47. OpenAI. Chatgpt, 2022.
  48. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  49. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, 2002a. Association for Computational Linguistics.
  50. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002b.
  51. Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  52. Human evaluation of text-to-image models on a multi-task benchmark, 2022.
  53. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision, 123:74 – 93, 2015.
  54. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
  55. Connecting vision and language with localized narratives. In ECCV, 2020.
  56. Improving language understanding by generative pre-training. 2018.
  57. Language models are unsupervised multitask learners. 2019.
  58. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  59. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  60. Hierarchical text-conditional image generation with clip latents, 2022.
  61. Dalle-2 is seeing double: Flaws in word-to-concept mapping in text2image models, 2022.
  62. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment. arXiv preprint arXiv:2306.08877, 2023.
  63. Learning what and where to draw. In Advances in Neural Information Processing Systems, 2016a.
  64. Generative adversarial text to image synthesis. In Proceedings of The 33rd International Conference on Machine Learning, pages 1060–1069, New York, New York, USA, 2016b. PMLR.
  65. Scaling up models and data with t5x and seqio, 2022. URL https://arxiv. org/abs/2203.17189.
  66. High-resolution image synthesis with latent diffusion models, 2021.
  67. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022a.
  68. Photorealistic text-to-image diffusion models with deep language understanding. ArXiv, abs/2205.11487, 2022b.
  69. Improved techniques for training gans. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2016.
  70. Cider: Consensus-based image description evaluation. In CVPR, 2015a.
  71. Cider: Consensus-based image description evaluation. In CVPR, pages 4566–4575. IEEE Computer Society, 2015b.
  72. Visual chatgpt: Talking, drawing and editing with visual foundation models, 2023.
  73. Visual entailment task for visually-grounded language learning. arXiv preprint arXiv:1811.10582, 2018.
  74. Imagereward: Learning and evaluating human preferences for text-to-image generation, 2023.
  75. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. 2018.
  76. An empirical study of gpt-3 for few-shot knowledge-based vqa. In AAAI, 2022.
  77. Mm-react: Prompting chatgpt for multimodal reasoning and action. 2023.
  78. What you see is what you read? improving text-image alignment evaluation. arXiv preprint arXiv:2305.10400, 2023.
  79. mplug-owl: Modularization empowers large language models with multimodality, 2023a.
  80. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023b.
  81. Scaling autoregressive models for content-rich text-to-image generation, 2022.
  82. From recognition to cognition: Visual commonsense reasoning. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6713–6724, 2019.
  83. BERTScore: Evaluating text generation with BERT. In ICLR, 2020.
  84. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  85. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127(3):302–321, 2019.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 5 likes.