RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness (2405.17220v1)
Abstract: Learning from feedback reduces the hallucination of multimodal LLMs (MLLMs) by aligning them with human preferences. While traditional methods rely on labor-intensive and time-consuming manual labeling, recent approaches employing models as automatic labelers have shown promising results without human intervention. However, these methods heavily rely on costly proprietary models like GPT-4V, resulting in scalability issues. Moreover, this paradigm essentially distills the proprietary models to provide a temporary solution to quickly bridge the performance gap. As this gap continues to shrink, the community is soon facing the essential challenge of aligning MLLMs using labeler models of comparable capability. In this work, we introduce RLAIF-V, a novel framework that aligns MLLMs in a fully open-source paradigm for super GPT-4V trustworthiness. RLAIF-V maximally exploits the open-source feedback from two perspectives, including high-quality feedback data and online feedback learning algorithm. Extensive experiments on seven benchmarks in both automatic and human evaluation show that RLAIF-V substantially enhances the trustworthiness of models without sacrificing performance on other tasks. Using a 34B model as labeler, RLAIF-V 7B model reduces object hallucination by 82.9\% and overall hallucination by 42.1\%, outperforming the labeler model. Remarkably, RLAIF-V also reveals the self-alignment potential of open-source MLLMs, where a 12B model can learn from the feedback of itself to achieve less than 29.5\% overall hallucination rate, surpassing GPT-4V (45.9\%) by a large margin. The results shed light on a promising route to enhance the efficacy of leading-edge MLLMs.
- The CRINGE loss: Learning what language not to model. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of ACL, pages 8854–8874. Association for Computational Linguistics, 2023.
- nocaps: novel object captioning at scale. In Proceedings of ICCV, pages 8948–8957, 2019.
- Yi: Open foundation models by 01.ai, 2024.
- Qwen-VL: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966, 2023.
- Hallucination of multimodal large language models: A survey, 2024.
- Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. CoRR, abs/2402.04788, 2024.
- Sharegpt4v: Improving large multi-modal models with better captions. CoRR, abs/2311.12793, 2023.
- Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
- InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Proceedings of NeurIPS, 2023.
- Seeing is believing: Mitigating hallucination in large vision-language models via clip-guided decoding. CoRR, abs/2402.15300, 2024.
- Multi-modal hallucination control by visual information grounding. CoRR, abs/2403.14003, 2024.
- MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- Scaling laws for reward model overoptimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of ICML, volume 202 of Proceedings of Machine Learning Research, pages 10835–10866. PMLR, 2023.
- Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of CVPR, pages 6325–6334. IEEE Computer Society, 2017.
- Detecting and preventing hallucinations in large vision language models. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, pages 18135–18143. AAAI Press, 2024.
- Skip \n: A simple method to reduce hallucination in large vision-language models. CoRR, abs/2402.01345, 2024.
- OPERA: alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In Processing of CVPR, 2024.
- Movienet: A holistic dataset for movie understanding. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Proceedings of ECCV, volume 12349 of Lecture Notes in Computer Science, pages 709–727. Springer, 2020.
- Fgaif: Aligning large vision-language models with fine-grained ai feedback, 2024.
- A diagram is worth a dozen images. ArXiv, abs/1603.07396, 2016.
- RLAIF: scaling reinforcement learning from human feedback with AI feedback. CoRR, abs/2309.00267, 2023.
- Mitigating object hallucinations in large vision-language models through visual contrastive decoding. CoRR, abs/2311.16922, 2023.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of ICML, volume 202 of Proceedings of Machine Learning Research, pages 19730–19742. PMLR, 2023.
- Silkie: Preference distillation for large visual language models. CoRR, abs/2312.10665, 2023.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
- Mini-gemini: Mining the potential of multi-modality vision language models. CoRR, abs/2403.18814, 2024.
- Evaluating object hallucination in large vision-language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 292–305. Association for Computational Linguistics, 2023.
- Microsoft COCO: common objects in context. In David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Proceedings of ECCV, volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer, 2014.
- Improved baselines with visual instruction tuning. CoRR, abs/2310.03744, 2023.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
- Visual instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Proceedings of NeurIPS, 2023.
- Mmbench: Is your multi-modal model an all-around player?, 2024.
- Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. In Proceedings of ICLR, 2024.
- MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. In Processing of ICLR, 2024.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In Proceedings of NeurIPS, 2022.
- OK-VQA: A visual question answering benchmark requiring external knowledge. In Proceedings of CVPR, pages 3195–3204. Computer Vision Foundation / IEEE, 2019.
- Meta. Introducing meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/, 2024. Accessed: 2024-05-09.
- openai. Introducing chatgpt. https://openai.com/index/chatgpt/, 2022. Accessed: 2022-12-05.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
- OpenAI. GPT-4V(ision) system card, 2023.
- OpenBMB. Large multi-modal models for strong performance and efficient deployment. https://github.com/OpenBMB/OmniLMM, 2024. Accessed: 2024-03-05.
- Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Proceedings of NeurIPS, 2023.
- Visual hallucination: Definition, quantification, and prescriptive remediations. CoRR, abs/2403.17306, 2024.
- Object hallucination in image captioning. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of EMNLP, pages 4035–4045. Association for Computational Linguistics, 2018.
- Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
- Towards VQA models that can read. In Proceedings of CVPR, pages 8317–8326. Computer Vision Foundation / IEEE, 2019.
- Aligning large multimodal models with factually augmented RLHF. CoRR, abs/2309.14525, 2023.
- Fine-tuning language models for factuality. CoRR, abs/2311.08401, 2023.
- An llm-free multi-dimensional benchmark for mllms hallucination evaluation. CoRR, abs/2311.07397, 2023.
- Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites. In Stevan Rudinac, Alan Hanjalic, Cynthia C. S. Liem, Marcel Worring, Björn Þór Jónsson, Bei Liu, and Yoko Yamakata, editors, MultiMedia Modeling - 30th International Conference, MMM 2024, Amsterdam, The Netherlands, January 29 - February 2, 2024, Proceedings, Part IV, volume 14557 of Lecture Notes in Computer Science, pages 32–45. Springer, 2024.
- Mitigating hallucinations in large vision-language models with instruction contrastive decoding, 2024.
- Google landmarks dataset v2 - A large-scale benchmark for instance-level recognition and retrieval. In Proceedings of CVPR, pages 2572–2581. Computer Vision Foundation / IEEE, 2020.
- Logical closed loop: Uncovering object hallucinations in large vision-language models, 2024.
- Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback, 2024.
- Reformulating vision-language foundation models and datasets towards universal multimodal assistants. CoRR, abs/2310.00653, 2023.
- RLHF-V: towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of CVPR, 2024.
- Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
- Advancing llm reasoning generalists with preference trees, 2024.
- Self-rewarding language models. CoRR, abs/2401.10020, 2024.
- MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. CoRR, abs/2311.16502, 2023.
- Less is more: Mitigating multimodal hallucination from an EOS decision perspective. CoRR, abs/2402.14545, 2024.
- From recognition to cognition: Visual commonsense reasoning. In Proceedings of CVPR, June 2019.
- Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. CoRR, abs/2310.01779, 2023.
- Mitigating object hallucination in large vision-language models via classifier-free guidance. CoRR, abs/2402.08680, 2024.
- Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. CoRR, abs/2311.16839, 2023.
- Exploring boundary of gpt-4v on marine analysis: A preliminary case study, 2024.
- Aligning modalities in vision large language models via preference fine-tuning. arXiv preprint arXiv:2402.11411, 2024.
- Analyzing and mitigating object hallucination in large vision-language models. In Proceedings of ICLR, 2024.
- IBD: alleviating hallucinations in large vision-language models via image-biased decoding. CoRR, abs/2402.18476, 2024.
- Tianyu Yu (20 papers)
- Haoye Zhang (6 papers)
- Yuan Yao (292 papers)
- Yunkai Dang (5 papers)
- Da Chen (42 papers)
- Xiaoman Lu (1 paper)
- Ganqu Cui (39 papers)
- Taiwen He (2 papers)
- Zhiyuan Liu (433 papers)
- Tat-Seng Chua (359 papers)
- Maosong Sun (337 papers)