Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
36 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception (2405.15232v4)

Published 24 May 2024 in cs.CV and cs.CL

Abstract: The development of LLMs has significantly advanced the emergence of large multimodal models (LMMs). While LMMs have achieved tremendous success by promoting the synergy between multimodal comprehension and creation, they often face challenges when confronted with out-of-distribution data, such as which can hardly distinguish orientation, quantity, color, structure, etc. This is primarily due to their reliance on image encoders trained to encode images into task-relevant features, which may lead them to disregard irrelevant details. Delving into the modeling capabilities of diffusion models for images naturally prompts the question: Can diffusion models serve as the eyes of LLMs for image perception? In this paper, we propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder. This addresses the drawbacks of previous methods that solely relied on image encoders like CLIP-ViT, thereby enhancing the model's resilience against out-of-distribution samples and reducing visual hallucinations. Importantly, this is achieved without requiring additional training modules and with fewer training parameters. We extensively evaluated DEEM on both our newly constructed RobustVQA benchmark and other well-known benchmarks, POPE and MMVP, for visual hallucination and perception. In particular, DEEM improves LMM's visual perception performance to a large extent (e.g., 4% higher on RobustVQA, 6.5% higher on MMVP and 12.8 % higher on POPE ). Compared to the state-of-the-art interleaved content generation models, DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data (10%), and a smaller base model size.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Visual instruction tuning. In NeurIPS, 2024.
  2. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  3. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2024.
  4. Flamingo: a visual language model for few-shot learning. In NeurIPS, pages 23716–23736, 2022.
  5. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  6. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2(3), 2023.
  7. Emu: Generative pretraining in multimodality. In ICLR, 2023.
  8. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
  9. Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer. arXiv preprint arXiv:2401.10208, 2024.
  10. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  11. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  12. Denoising diffusion probabilistic models. In NeurIPS, pages 6840–6851, 2020.
  13. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900, 2022.
  14. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742, 2023.
  15. A convnet for the 2020s. In CVPR, pages 11976–11986, 2022.
  16. Judging llm-as-a-judge with mt-bench and chatbot arena. In NeurIPS, 2024.
  17. Laion coco: 600m synthetic captions from laion2b-en, 2022.
  18. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, pages 25278–25294, 2022.
  19. Natural adversarial examples. In CVPR, pages 15262–15271, 2021.
  20. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, pages 8340–8349, 2021.
  21. Do imagenet classifiers generalize to imagenet? In ICML, pages 5389–5400, 2019.
  22. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023.
  23. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  24. Generating images with multimodal language models. In NeurIPS, 2024.
  25. Obelics: An open web-scale filtered dataset of interleaved image-text documents. In NeurIPS, 2024.
  26. Multimodal c4: An open, billion-scale corpus of images interleaved with text. In NeurIPS, 2024.
  27. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
  28. Next-chat: An lmm for chat, detection and segmentation. arXiv preprint arXiv: 2311.04498, 2023.
  29. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  30. Im2text: Describing images using 1 million captioned photographs. In NeurIPS, 2011.
  31. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pages 3558–3568, 2021.
  32. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  33. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017.
  34. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, pages 742–758, 2020.
  35. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, pages 947–952, 2019.
  36. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019.
  37. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, pages 3195–3204, 2019.
  38. Towards vqa models that can read. In CVPR, pages 8317–8326, 2019.
  39. A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, pages 146–162, 2022.
  40. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, pages 787–798, 2014.
  41. Generation and comprehension of unambiguous object descriptions. In CVPR, pages 11–20, 2016.
  42. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, pages 1971–1978, 2014.
  43. Partimagenet: A large, high-quality dataset of parts. In ECCV, pages 128–145, 2022.
  44. From recognition to cognition: Visual commonsense reasoning. In CVPR, pages 6720–6731, 2019.
  45. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123:32–73, 2017.
  46. A hierarchical approach for generating descriptive image paragraphs. In CVPR, pages 317–325, 2017.
  47. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, pages 3608–3617, 2018.
  48. Visual dialog. In CVPR, pages 326–335, 2017.
  49. IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics, 2023.
  50. Language is not all you need: Aligning perception with language models. In NeurIPS, 2024.
  51. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023.
  52. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  53. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  54. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023.
  55. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022.
  56. Zero-shot text-to-image generation. In ICML, pages 8821–8831, 2021.
  57. Cogview: Mastering text-to-image generation via transformers. In NeurIPS, pages 19822–19835, 2021.
  58. Cogview2: Faster and better text-to-image generation via hierarchical transformers. In NeurIPS, pages 16890–16902, 2022.
  59. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  60. Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV, pages 89–106, 2022.
  61. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  62. Multilingual universal sentence encoder for semantic retrieval. arXiv preprint arXiv:1907.04307, 2019.
  63. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, pages 36479–36494, 2022.
  64. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  65. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
  66. Cider: Consensus-based image description evaluation. In CVPR, pages 4566–4575, 2015.
  67. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
  68. Connecting vision and language with localized narratives. In ECCV, pages 647–664, 2020.
  69. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  70. Vqa: Visual question answering. In ICCV, pages 2425–2433, 2015.
Citations (8)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com