Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VILA: On Pre-training for Visual Language Models (2312.07533v4)

Published 12 Dec 2023 in cs.CV
VILA: On Pre-training for Visual Language Models

Abstract: Visual LLMs (VLMs) rapidly progressed with the recent success of LLMs. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial whereas image-text pairs alone are not optimal; (3) re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of text-only tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe we build VILA, a Visual LLM family that consistently outperforms the state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells and whistles. Multi-modal pre-training also helps unveil appealing properties of VILA, including multi-image reasoning, enhanced in-context learning, and better world knowledge.

Visual LLM Pre-training

Introduction and Context

Recent advancements in AI research have shown considerable improvements by extending LLMs to incorporate visual inputs, creating visual LLMs (VLMs). These models have shown promising results in comprehending and generating content that combines both text and visual information, a process known as "multimodal learning." A critical component in the development of VLMs has been the pre-training process, where a model is trained on a large dataset that includes both text and images. However, the specifics of augmenting an LLM with visual capabilities, known as visual language pre-training, have not been deeply explored. This work aims to fill that gap by examining various design approaches for visual language pre-training.

Pre-training Factors and Findings

The paper identifies three key findings from the augmentation process. Firstly, while freezing the LLM during pre-training can produce acceptable results in zero-shot tasks (where the model makes predictions without seeing similar examples), it falls short in tasks that require in-context learning. Here, unfreezing or updating the LLM proves to be crucial. Secondly, incorporating interleaved pre-training data, which includes combined text and image datasets with text segments interspersed with pictures, offers substantial benefits. It provides more precise gradient updates and helps maintain text-only capabilities. Lastly, adding text-only instruction data to image-text data during supervised fine-tuning (SFT) not only helps recover the model's text-only task degradation but also improves accuracy in visual language tasks. These insights are critical in designing pre-training regimes for future VLMs.

Training Strategies and Outcomes

The paper's proposed pre-training design, named VILA (Visual Instruction tuning with Linear Attention), consistently surpasses state-of-the-art models across various benchmarks. Moreover, VILA showcases additional capabilities, such as multi-image reasoning and robust in-context learning, even when presented with inputs it has not been explicitly trained on.

Model Training and Evaluation

VILA is trained in multiple stages, starting with projector initialization and followed by pre-training on visual language corpora. It's then fine-tuned via visual instruction datasets with dataset-specific prompts. The evaluations used a variety of visual language tasks to gauge the model's performance in zero-shot and few-shot settings, reflecting its in-context learning capabilities.

Conclusion and Future Considerations

The findings from this paper offer a clear pathway toward creating more effective VLMs by identifying crucial aspects of the visual language pre-training process. The resulting VILA model showcases improved performance across numerous visual language tasks without compromising its text-only abilities. Future research could further enhance these findings by exploring additional pre-training datasets, optimizing training throughput, and scaling up the pre-training corpus.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. GPT-4 technical report. Technical report, OpenAI, 2023. https://arxiv.org/abs/2303.08774.
  2. Fuyu-8B: A multimodal architecture for AI agents. https://www.adept.ai/blog/fuyu-8b, 2023.
  3. Gemini: A family of highly capable multimodal models. Technical report, Gemini Team, Google, 2023. https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf.
  4. Yi-34B large language model. https://huggingface.co/01-ai/Yi-34B, 2023.
  5. Jointly training large autoregressive multimodal models. arXiv preprint arXiv:2309.15564, 2023.
  6. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  7. Openflamingo, 2023.
  8. Qwen technical report. Technical report, Alibaba Group, 2023a. https://arxiv.org/abs/2303.08774.
  9. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
  10. Language models are few-shot learners. In Advances in Neural Information Processing Systems, pages 1877–1901. Curran Associates, Inc., 2020.
  11. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  12. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a.
  13. Sharegpt4v: Improving large multi-modal models with better captions, 2023b.
  14. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023c.
  15. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  16. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  17. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  18. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500, 2023.
  19. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  20. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  21. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR, 2022.
  22. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proc. of NAACL, 2019.
  23. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  24. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  25. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  26. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  27. Measuring massive multitask language understanding. CoRR, abs/2009.03300, 2020.
  28. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  29. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  30. IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics, 2023.
  31. Mistral–a journey towards reproducible language model training, 2021.
  32. Packing: Towards 2x nlp bert acceleration. 2021.
  33. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
  34. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  35. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
  36. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023c.
  37. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  38. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  39. Visual instruction tuning. 2023b.
  40. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023c.
  41. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  42. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  43. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
  44. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  45. OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt, 2023. Accessed: 2023.
  46. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  47. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  48. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  49. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  50. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  51. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  52. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  53. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  54. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  55. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  56. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
  57. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  58. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  59. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  60. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  61. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  62. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  63. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  64. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 2023.
  65. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  66. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  67. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  68. Opt: Open pre-trained transformer language models, 2022.
  69. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  70. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
  71. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Ji Lin (47 papers)
  2. Hongxu Yin (49 papers)
  3. Wei Ping (51 papers)
  4. Yao Lu (212 papers)
  5. Pavlo Molchanov (70 papers)
  6. Andrew Tao (40 papers)
  7. Huizi Mao (13 papers)
  8. Jan Kautz (215 papers)
  9. Mohammad Shoeybi (60 papers)
  10. Song Han (155 papers)
Citations (197)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews