Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 42 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 80 tok/s Pro
Kimi K2 127 tok/s Pro
GPT OSS 120B 471 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

See, Say, and Segment: Teaching LMMs to Overcome False Premises (2312.08366v1)

Published 13 Dec 2023 in cs.CV

Abstract: Current open-source Large Multimodal Models (LMMs) excel at tasks such as open-vocabulary language grounding and segmentation but can suffer under false premises when queries imply the existence of something that is not actually present in the image. We observe that existing methods that fine-tune an LMM to segment images significantly degrade their ability to reliably determine ("see") if an object is present and to interact naturally with humans ("say"), a form of catastrophic forgetting. In this work, we propose a cascading and joint training approach for LMMs to solve this task, avoiding catastrophic forgetting of previous skills. Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, proposing alternative queries or correcting semantic errors in the query, and finally "segment" by outputting the mask of the desired objects if they exist. Additionally, we introduce a novel False Premise Correction benchmark dataset, an extension of existing RefCOCO(+/g) referring segmentation datasets (which we call FP-RefCOCO(+/g)). The results show that our method not only detects false premises up to 55% better than existing approaches, but under false premise conditions produces relative cIOU improvements of more than 31% over baselines, and produces natural language feedback judged helpful up to 67% of the time.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. TPAMI, 2017.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
  4. Clair: Evaluating image captions with large language models. arXiv preprint arXiv:2310.12971, 2023.
  5. Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437, 2023a.
  6. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
  7. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2018.
  8. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020.
  9. Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 2021.
  10. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  11. Visual programming for text-to-image generation and evaluation. arXiv preprint arXiv:2305.15328, 2023.
  12. Learning to evaluate image captioning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 5804–5812. IEEE Computer Society, 2018.
  13. Ernest Davis. Unanswerable questions about images and texts. Frontiers in Artificial Intelligence, 3:51, 2020.
  14. Vision-language transformer and query generation for referring segmentation. In ICCV, 2021.
  15. Vlt: Vision-language transformer and query generation for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  16. From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1473–1482, 2015.
  17. Automatic caption generation for news images. IEEE transactions on pattern analysis and machine intelligence, 35(4):797–812, 2012.
  18. Dual attention network for scene segmentation. In CVPR, 2019.
  19. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  20. Mask r-cnn. In ICCV, 2017.
  21. CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528. Association for Computational Linguistics, 2021.
  22. Lora: Low-rank adaptation of large language models. arXiv:2106.09685, 2021.
  23. Segmentation from natural language expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 108–124. Springer, 2016.
  24. Deconfounded visual grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 998–1006, 2022.
  25. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, 2019.
  26. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  27. TIGEr: Text-to-image grounding for image caption evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2141–2152. Association for Computational Linguistics, 2019.
  28. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
  29. NUBIA: NeUral based interchangeability assessor for text generation. In Proceedings of the 1st Workshop on Evaluating NLG Evaluation, pages 28–37. Association for Computational Linguistics, 2020.
  30. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
  31. Panoptic segmentation. In CVPR, 2019.
  32. Segment anything. arXiv:2304.02643, 2023.
  33. Semi-supervised semantic segmentation with directional context-aware consistency. In CVPR, 2021.
  34. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  35. Dynamic language binding in relational visual reasoning. arXiv preprint arXiv:2004.14603, 2020.
  36. Neural networks for detecting irrelevant questions during visual question answering. In Artificial Neural Networks and Machine Learning–ICANN 2020: 29th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 15–18, 2020, Proceedings, Part II 29, pages 786–797. Springer, 2020.
  37. Fully convolutional networks for panoptic segmentation. In CVPR, 2021.
  38. Gres: Generalized referring expression segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23592–23601, 2023a.
  39. Visual instruction tuning. arXiv:2304.08485, 2023b.
  40. Clevr-ref+: Diagnosing visual reasoning with referring expressions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4185–4194, 2019.
  41. Parsenet: Looking wider to see better. arXiv, 2015.
  42. Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR, 2020.
  43. The promise of premise: Harnessing question premises in visual question answering. arXiv preprint arXiv:1705.00601, 2017.
  44. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
  45. Semantic multi-modal reprojection for robust visual question answering. In 2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pages 1–6. IEEE, 2022.
  46. Robust visual question answering via semantic cross modal augmentation. Computer Vision and Image Understanding, page 103862, 2023.
  47. Image-grounded conversations: Multimodal context for natural question and response generation. arXiv preprint arXiv:1701.08251, 2017.
  48. Learning deconvolution network for semantic segmentation. In ICCV, 2015.
  49. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  50. OpenAI. Gpt-4v(ision) system card, 2023.
  51. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
  52. Kosmos-2: Grounding multimodal large language models to the world. arXiv:2306.14824, 2023.
  53. Question relevance in visual question answering. arXiv preprint arXiv:1807.08435, 2018.
  54. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8748–8763. PMLR, 2021.
  55. Paco: Parts and attributes of common objects. In CVPR, 2023.
  56. Question relevance in vqa: Identifying non-visual and false-premise questions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 919–924, 2016.
  57. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156, 2018.
  58. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  59. Fully convolutional networks for semantic segmentation. TPAMI, 2017.
  60. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 217–223, 2017.
  61. Contrastive grouping with transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23570–23580, 2023.
  62. Adaptive perspective distillation for semantic segmentation. TPAMI, 2022.
  63. Learning context-aware classifier for semantic segmentation. AAAI, 2023.
  64. Question action relevance and editing for visual question answering. Multimedia Tools and Applications, 78:2921–2935, 2019.
  65. Referring expression comprehension model with matching detection and linguistic feedback. IET Computer Vision, 14(8):625–633, 2020.
  66. Cris: Clip-driven referring image segmentation. In CVPR, 2022.
  67. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  68. Reliable visual question answering: Abstain rather than answer incorrectly. In European Conference on Computer Vision, pages 148–166. Springer, 2022.
  69. Towards robust referring image segmentation. arXiv preprint arXiv:2209.09554, 2022.
  70. Upsnet: A unified panoptic segmentation network. In CVPR, 2019.
  71. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057. PMLR, 2015.
  72. Denseaspp for semantic segmentation in street scenes. In CVPR, 2018.
  73. Lavt: Language-aware vision transformer for referring image segmentation. In CVPR, 2022.
  74. Improving image captioning evaluation by considering inter references variance. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 985–994. Association for Computational Linguistics, 2020.
  75. Ferret: Refer and ground anything anywhere at any granularity, 2023.
  76. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
  77. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
  78. Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279, 2023.
  79. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv:2307.03601, 2023.
  80. K-net: Towards unified image segmentation. NeurIPS, 2021.
  81. Pyramid scene parsing network. In CVPR, 2017.
  82. Icnet for real-time semantic segmentation on high-resolution images. In ECCV, 2018a.
  83. Psanet: Point-wise spatial attention network for scene parsing. In ECCV, 2018b.
  84. Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581, 2023.
  85. Scene parsing through ade20k dataset. In CVPR, 2017.
  86. Asymmetric non-local neural networks for semantic segmentation. In ICCV, 2019.
  87. Generalized decoding for pixel, image, and language. In CVPR, 2023a.
  88. Segment everything everywhere all at once. arXiv:2304.06718, 2023b.
Citations (12)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube