Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space (2402.16832v2)

Published 26 Feb 2024 in cs.CL, cs.AI, and cs.CV

Abstract: Multimodal LLMs (MLLMs) like LLaVA and GPT-4(V) enable general-purpose conversations about images with the language modality. As off-the-shelf MLLMs may have limited capabilities on images from domains like dermatology and agriculture, they must be fine-tuned to unlock domain-specific applications. The prevalent architecture of current open-source MLLMs comprises two major modules: an image-language (cross-modal) projection network and a LLM. It is desirable to understand the roles of these two modules in modeling domain-specific visual attributes to inform the design of future models and streamline the interpretability efforts on the current models. To this end, via experiments on 4 datasets and under 2 fine-tuning settings, we find that as the MLLM is fine-tuned, it indeed gains domain-specific visual capabilities, but the updates do not lead to the projection extracting relevant domain-specific visual attributes. Our results indicate that the domain-specific visual attributes are modeled by the LLM, even when only the projection is fine-tuned. Through this study, we offer a potential reinterpretation of the role of cross-modal projections in MLLM architectures. Project webpage: https://claws-lab.github.io/projection-in-MLLMs/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Abien Fred Agarap. 2018. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375.
  3. Landing AI. 2024. Introducing domain-specific large vision models. https://landing.ai/blog/introducing-domain-specific-large-vision-models/. Accessed: 2024-02-14.
  4. Crisismmd: Multimodal twitter datasets from natural disasters. In Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM).
  5. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  6. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  7. Automatic defect classification (adc) solution using data-centric artificial intelligence (ai) for outgoing quality inspections in the semiconductor industry. In Metrology, Inspection, and Process Control XXXVII, volume 12496, pages 830–836. SPIE.
  8. Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32.
  9. Unsupervised feature learning and deep learning: A review and new perspectives. CoRR, abs/1206.5538, 1(2665):2012.
  10. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
  11. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500.
  12. Konstantinos P Ferentinos. 2018. Deep learning models for plant disease detection and diagnosis. Computers and electronics in agriculture, 145:311–318.
  13. Mllm-bench, evaluating multi-modal llms using gpt-4v. arXiv preprint arXiv:2311.13951.
  14. Multimodal neurons in artificial neural networks. Distill, 6(3):e30.
  15. A vision transformer model for convolution-free multilabel classification of satellite imagery in deforestation monitoring. IEEE Transactions on Neural Networks and Learning Systems.
  16. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  17. Concept bottleneck models. In International conference on machine learning, pages 5338–5348. PMLR.
  18. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890.
  19. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.
  20. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566.
  21. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  22. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems.
  23. Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics, 19(6):1236–1246.
  24. Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058.
  25. Bridging the digital divide: Performance variation across socio-economic factors in vision-language models. arXiv preprint arXiv:2311.05746.
  26. Analysis of social media data using multimodal deep learning for disaster response. In 17th International Conference on Information Systems for Crisis Response and Management. ISCRAM, ISCRAM.
  27. Finding and editing multi-modal neurons in pre-trained transformer. arXiv preprint arXiv:2311.07470.
  28. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  29. Derm-nn: skin diseases detection using convolutional neural network. In 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS), pages 1205–1209. IEEE.
  30. Multimodal neurons in pretrained text-only transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2862–2867.
  31. Plantdoc: A dataset for visual plant disease detection. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, pages 249–253.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  33. "kelly is a warm person, joseph is a role model": Gender biases in llm-generated reference letters. arXiv preprint arXiv:2310.09219.
  34. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  35. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Gaurav Verma (34 papers)
  2. Minje Choi (13 papers)
  3. Kartik Sharma (18 papers)
  4. Jamelle Watson-Daniels (6 papers)
  5. Sejoon Oh (12 papers)
  6. Srijan Kumar (61 papers)
Citations (7)
Github Logo Streamline Icon: https://streamlinehq.com