Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 96 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 24 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 434 tok/s Pro
Kimi K2 198 tok/s Pro
2000 character limit reached

Improved Alignment of Modalities in Large Vision Language Models (2503.19508v1)

Published 25 Mar 2025 in cs.CV and cs.LG

Abstract: Recent advancements in vision-LLMs have achieved remarkable results in making LLMs understand vision inputs. However, a unified approach to align these models across diverse tasks such as image captioning and visual question answering remains a challenge. Existing methods either require very big LLMs or very big datasets which is not efficient in utilizing existing models. This paper addresses this gap and devises a training strategy of auto-regressive vision-LLMs, to unify vision-language tasks like image-captioning and visual question answering. We propose four training stages for aligning the vision model with the LLM, in other words, the LLM is given an ability to process visual inputs. We also devise different attention masks for training transformer-based LLMs that improve the quality of visual features. Further, we introduce some findings, 1) the attention mask should not be applied on visual inputs, 2) the LLM converges faster on AI- generated data, 3) More work should be done in the alignment stage during the pre-training of the model, 4) the model can easily adapt to any downstream tasks like visual question answering on healthcare datasets like PathVQA. After training the model for one epoch for all the stages, it outperforms large models like VILA-13 billion models on common benchmarks like CIDEr scores on COCO and Flickr30k datasets and achieves very close scores to GIT-2 on the same dataset despite being a much smaller model trained on a much smaller dataset. All of the training is done using best practices available like multi- GPU parallel training, lower-precision training with 16-bit float numbers, faster attention (SDPA), and gradient accumulation, and completed the training within 12 hours.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.