Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models (2402.13607v3)

Published 21 Feb 2024 in cs.CV and cs.CL

Abstract: Multimodal LLMs (MLLMs) have demonstrated promising results in a variety of tasks that combine vision and language. As these models become more integral to research and applications, conducting comprehensive evaluations of their capabilities has grown increasingly important. However, most existing benchmarks fail to consider that, in certain situations, images need to be interpreted within a broader context. In this work, we introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. Further analysis confirms that these models struggle to effectively extract and utilize contextual information to improve their understanding of images. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner. View our project website at https://thunlp-mt.github.io/CODIS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433.
  2. OpenFlamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
  3. Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  4. Two distinct scene-processing networks connecting vision and memory. Eneuro, 3(5).
  5. Moshe Bar and Elissa Aminoff. 2003. Cortical analysis of visual context. Neuron, 38(2):347–358.
  6. ShareGPT4V: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793.
  7. DiffusionDet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19830–19843.
  8. Microsoft COCO Captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
  9. See, think, confirm: Interactive prompting between vision and language models for knowledge-based visual reasoning. arXiv preprint arXiv:2301.05226.
  10. RelTR: Relation transformer for scene graph generation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  11. Holistic analysis of hallucination in GPT-4V(ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287.
  12. InstructBLIP: Towards general-purpose vision-language odels with instruction tuning. arXiv preprint arXiv:2305.06500.
  13. Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 326–335.
  14. Humans incorporate attention-dependent uncertainty into perceptual decisions and confidence. Proceedings of the National Academy of Sciences, 115(43):11090–11095.
  15. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimedia Tools and Applications, 82(6):9243–9275.
  16. MMDialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. arXiv preprint arXiv:2211.05719.
  17. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
  18. A challenger to GPT-4V? early explorations of Gemini in visual expertise. arXiv preprint arXiv:2312.12436.
  19. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  20. Cortical areas involved in object, background, and object-background processing revealed with functional magnetic resonance adaptation. Journal of Neuroscience, 24(45):10223–10228.
  21. HallusionBench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv e-prints, pages arXiv–2310.
  22. Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual Programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962.
  23. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73.
  24. MIMIC-IT: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425.
  25. SEED-Bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
  26. Fine-tuning multimodal llms to follow zeroshot demonstrative instructions. arXiv preprint arXiv:2308.04152, 3.
  27. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  28. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  29. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  30. Cross-modal causal relational reasoning for event-level visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  31. MMBench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
  32. Semantic-conditional diffusion networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23359–23368.
  33. OpenAI. 2023. GPT-4V(ision) System Card.
  34. Audio-visual floorplan reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1183–1192.
  35. Multi-label iterated learning for image classification with label ambiguity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4783–4793.
  36. Ambiguous images with human judgments for robust visual event classification. Advances in Neural Information Processing Systems, 35:2637–2650.
  37. MI-GAN: A simple baseline for image inpainting on mobile devices. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7335–7345.
  38. Is one annotation enough?-a data-centric image classification benchmark for noisy and ambiguous label estimation. Advances in Neural Information Processing Systems, 35:33215–33232.
  39. Prompting large language models with answer heuristics for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14974–14983.
  40. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286.
  41. Melissa Le-Hoa Vo. 2021. The meaning and structure of scenes. Vision Research, 181:10–20.
  42. Controllable image captioning via prompting. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2617–2625.
  43. Xuan Wang and Zhigang Zhu. 2023. Context understanding in computer vision: A survey. Computer Vision and Image Understanding, 229:103646.
  44. Deep learning for image inpainting: A survey. Pattern Recognition, 134:109046.
  45. Tiny object detection with context enhancement and feature purification. Expert Systems with Applications, 211:118665.
  46. Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV).
  47. mPLUG-Owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257.
  48. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6720–6731.
  49. Image inpainting based on deep learning: A review. Information Fusion, 90:74–94.
  50. MMICL: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915.
  51. Prototype-based embedding network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22783–22792.
  52. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  53. Scene graph generation: A comprehensive survey. arXiv preprint arXiv:2201.00443.
  54. Object detection in 20 years: A survey. Proceedings of the IEEE.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Fuwen Luo (14 papers)
  2. Chi Chen (62 papers)
  3. Zihao Wan (3 papers)
  4. Zhaolu Kang (4 papers)
  5. Qidong Yan (1 paper)
  6. Yingjie Li (57 papers)
  7. Xiaolong Wang (243 papers)
  8. Siyu Wang (55 papers)
  9. Ziyue Wang (75 papers)
  10. Xiaoyue Mi (9 papers)
  11. Peng Li (390 papers)
  12. Ning Ma (39 papers)
  13. Maosong Sun (337 papers)
  14. Yang Liu (2253 papers)
Citations (4)