Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models (2402.10884v2)

Published 16 Feb 2024 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: Multi-modal LLMs (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities in production. However, the current MLLMs trained with visual-question-answering (VQA) datasets could suffer from degradation, as VQA datasets lack the diversity and complexity of the original text instruction datasets with which the underlying LLM was trained. To address this degradation, we first collect a lightweight, 5k-sample VQA preference dataset where answers were annotated by Gemini for five quality metrics in a granular fashion and investigate standard Supervised Fine-tuning, rejection sampling, Direct Preference Optimization (DPO) and SteerLM algorithms. Our findings indicate that with DPO, we can surpass the instruction-following capabilities of the LLM, achieving a 6.73 score on MT-Bench, compared to Vicuna's 6.57 and LLaVA's 5.99. This enhancement in textual instruction-following capability correlates with boosted visual instruction performance (+4.9\% on MM-Vet, +6\% on LLaVA-Bench), with minimal alignment tax on visual knowledge benchmarks compared to the previous RLHF approach. In conclusion, we propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that restores and boosts MLLM's language capability after visual instruction tuning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface, 2023. URL https://arxiv.org/abs/2303.17580.
  2. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  3. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  4. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023a.
  5. Mm-react: Prompting chatgpt for multimodal reasoning and action. ArXiv preprint, abs/2303.11381, 2023. URL https://arxiv.org/abs/2303.11381.
  6. Visual chatgpt: Talking, drawing and editing with visual foundation models. ArXiv preprint, abs/2303.04671, 2023. URL https://arxiv.org/abs/2303.04671.
  7. Multimodal-gpt: A vision and language model for dialogue with humans. ArXiv preprint, abs/2305.04790, 2023. URL https://arxiv.org/abs/2305.04790.
  8. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
  9. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023a. URL https://arxiv.org/abs/2302.13971.
  10. Visual instruction tuning, 2023a.
  11. mplug-owl: Modularization empowers large language models with multimodality. ArXiv preprint, abs/2304.14178, 2023a. URL https://arxiv.org/abs/2304.14178.
  12. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  13. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
  14. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  15. Otter: A multi-modal model with in-context instruction tuning. ArXiv preprint, abs/2305.03726, 2023b. URL https://arxiv.org/abs/2305.03726.
  16. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  17. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023b.
  18. Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023. URL https://https://huggingface.co/Open-Orca/SlimOrca.
  19. Sharegpt4v: Improving large multi-modal models with better captions, 2023.
  20. The false promise of imitating proprietary llms, 2023.
  21. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  22. Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf, 2023.
  23. Aligning large multimodal models with factually augmented rlhf, 2023.
  24. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023b. URL https://arxiv.org/abs/2307.09288.
  25. Zephyr: Direct distillation of lm alignment, 2023.
  26. Ultrafeedback: Boosting language models with high-quality feedback, 2023.
  27. HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM. arXiv, 2023. doi:10.48550/arXiv.2311.09528.
  28. SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs. arXiv, 2023. doi:10.48550/arXiv.2308.03349.
  29. Aligning large multi-modal model with robust instruction tuning. ArXiv preprint, abs/2306.14565, 2023b. URL https://arxiv.org/abs/2306.14565.
  30. Gemini: a family of highly capable multimodal models. ArXiv preprint, abs/2312.11805, 2023. URL https://arxiv.org/abs/2312.11805.
  31. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  32. OpenCompass, 2023. URL https://opencompass.org.cn/leaderboard-multimodal. [Online; accessed 24. Jan. 2024].
  33. Constitutional ai: Harmlessness from ai feedback, 2022.
  34. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
  35. Evaluating object hallucination in large vision-language models, 2023c.
  36. Mmbench: Is your multi-modal model an all-around player?, 2023c.
  37. Rome: Evaluating pre-trained vision-language models on reasoning beyond visual common sense. ArXiv preprint, abs/2310.19301, 2023a. URL https://arxiv.org/abs/2310.19301.
  38. LIMA: Less Is More for Alignment. arXiv, 2023b. doi:10.48550/arXiv.2305.11206.
  39. A few more examples may be worth billions of parameters. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1017–1029, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.findings-emnlp.72.
  40. Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning. arXiv, 2023d. doi:10.48550/arXiv.2306.14565.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Shengzhi Li (4 papers)
  2. Rongyu Lin (4 papers)
  3. Shichao Pei (14 papers)
Citations (19)