Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space (2402.16832v2)
Abstract: Multimodal LLMs (MLLMs) like LLaVA and GPT-4(V) enable general-purpose conversations about images with the language modality. As off-the-shelf MLLMs may have limited capabilities on images from domains like dermatology and agriculture, they must be fine-tuned to unlock domain-specific applications. The prevalent architecture of current open-source MLLMs comprises two major modules: an image-language (cross-modal) projection network and a LLM. It is desirable to understand the roles of these two modules in modeling domain-specific visual attributes to inform the design of future models and streamline the interpretability efforts on the current models. To this end, via experiments on 4 datasets and under 2 fine-tuning settings, we find that as the MLLM is fine-tuned, it indeed gains domain-specific visual capabilities, but the updates do not lead to the projection extracting relevant domain-specific visual attributes. Our results indicate that the domain-specific visual attributes are modeled by the LLM, even when only the projection is fine-tuned. Through this study, we offer a potential reinterpretation of the role of cross-modal projections in MLLM architectures. Project webpage: https://claws-lab.github.io/projection-in-MLLMs/
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Abien Fred Agarap. 2018. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375.
- Landing AI. 2024. Introducing domain-specific large vision models. https://landing.ai/blog/introducing-domain-specific-large-vision-models/. Accessed: 2024-02-14.
- Crisismmd: Multimodal twitter datasets from natural disasters. In Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM).
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Automatic defect classification (adc) solution using data-centric artificial intelligence (ai) for outgoing quality inspections in the semiconductor industry. In Metrology, Inspection, and Process Control XXXVII, volume 12496, pages 830–836. SPIE.
- Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32.
- Unsupervised feature learning and deep learning: A review and new perspectives. CoRR, abs/1206.5538, 1(2665):2012.
- Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
- Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500.
- Konstantinos P Ferentinos. 2018. Deep learning models for plant disease detection and diagnosis. Computers and electronics in agriculture, 145:311–318.
- Mllm-bench, evaluating multi-modal llms using gpt-4v. arXiv preprint arXiv:2311.13951.
- Multimodal neurons in artificial neural networks. Distill, 6(3):e30.
- A vision transformer model for convolution-free multilabel classification of satellite imagery in deforestation monitoring. IEEE Transactions on Neural Networks and Learning Systems.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Concept bottleneck models. In International conference on machine learning, pages 5338–5348. PMLR.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890.
- Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.
- Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
- Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems.
- Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics, 19(6):1236–1246.
- Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058.
- Bridging the digital divide: Performance variation across socio-economic factors in vision-language models. arXiv preprint arXiv:2311.05746.
- Analysis of social media data using multimodal deep learning for disaster response. In 17th International Conference on Information Systems for Crisis Response and Management. ISCRAM, ISCRAM.
- Finding and editing multi-modal neurons in pre-trained transformer. arXiv preprint arXiv:2311.07470.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Derm-nn: skin diseases detection using convolutional neural network. In 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS), pages 1205–1209. IEEE.
- Multimodal neurons in pretrained text-only transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2862–2867.
- Plantdoc: A dataset for visual plant disease detection. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, pages 249–253.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- "kelly is a warm person, joseph is a role model": Gender biases in llm-generated reference letters. arXiv preprint arXiv:2310.09219.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.
- Gaurav Verma (34 papers)
- Minje Choi (13 papers)
- Kartik Sharma (18 papers)
- Jamelle Watson-Daniels (6 papers)
- Sejoon Oh (12 papers)
- Srijan Kumar (61 papers)