Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback (2404.05046v1)

Published 7 Apr 2024 in cs.CV and cs.CL

Abstract: Large Vision-LLMs (LVLMs) have demonstrated proficiency in tackling a variety of visual-language tasks. However, current LVLMs suffer from misalignment between text and image modalities which causes three kinds of hallucination problems, i.e., object existence, object attribute, and object relationship. To tackle this issue, existing methods mainly utilize Reinforcement Learning (RL) to align modalities in LVLMs. However, they still suffer from three main limitations: (1) General feedback can not indicate the hallucination type contained in the response; (2) Sparse rewards only give the sequence-level reward for the whole response; and (3)Annotation cost is time-consuming and labor-intensive. To handle these limitations, we propose an innovative method to align modalities in LVLMs through Fine-Grained Artificial Intelligence Feedback (FGAIF), which mainly consists of three steps: AI-based Feedback Collection, Fine-grained Reward Model Training, and Reinforcement Learning with Fine-grained Reward. Specifically, We first utilize AI tools to predict the types of hallucination for each segment in the response and obtain a collection of fine-grained feedback. Then, based on the collected reward data, three specialized reward models are trained to produce dense rewards. Finally, a novel fine-grained feedback module is integrated into the Proximal Policy Optimization (PPO) algorithm. Extensive experiments are conducted on hallucination and general benchmarks, demonstrating the superior performance of our proposed method. Notably, compared with previous models trained with the RL-based aligning method, our proposed method is effective even with fewer parameters.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. VQA: visual question answering. In IEEE International Conference on Computer Vision, pp.  2425–2433. IEEE Computer Society, 2015.
  2. Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073, 2022. doi: 10.48550/ARXIV.2212.08073. URL https://doi.org/10.48550/arXiv.2212.08073.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
  4. vicuna: An opensource chatbot impressing gpt-4 with 90 2023.
  5. Holistic analysis of hallucination in gpt-4v(ision): Bias and interference challenges. CoRR, abs/2311.03287, 2023. doi: 10.48550/ARXIV.2311.03287. URL https://doi.org/10.48550/arXiv.2311.03287.
  6. Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500, 2023. doi: 10.48550/arXiv.2305.06500. URL https://doi.org/10.48550/arXiv.2305.06500.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  8. Multimodal-gpt: A vision and language model for dialogue with humans. CoRR, abs/2305.04790, 2023. doi: 10.48550/arXiv.2305.04790. URL https://doi.org/10.48550/arXiv.2305.04790.
  9. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  10. Lora: Low-rank adaptation of large language models. In The International Conference on Learning Representations. OpenReview.net, 2022b.
  11. OPERA: alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. CoRR, abs/2311.17911, 2023. doi: 10.48550/ARXIV.2311.17911. URL https://doi.org/10.48550/arXiv.2311.17911.
  12. FAITHSCORE: evaluating hallucinations in large vision-language models. CoRR, abs/2311.01477, 2023. doi: 10.48550/ARXIV.2311.01477. URL https://doi.org/10.48550/arXiv.2311.01477.
  13. RLAIF: scaling reinforcement learning from human feedback with AI feedback. CoRR, abs/2309.00267, 2023. doi: 10.48550/ARXIV.2309.00267. URL https://doi.org/10.48550/arXiv.2309.00267.
  14. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. CoRR, abs/2311.16922, 2023. doi: 10.48550/ARXIV.2311.16922. URL https://doi.org/10.48550/arXiv.2311.16922.
  15. Silkie: Preference distillation for large visual language models. CoRR, abs/2312.10665, 2023a. doi: 10.48550/ARXIV.2312.10665. URL https://doi.org/10.48550/arXiv.2312.10665.
  16. Silkie: Preference distillation for large visual language models. CoRR, abs/2312.10665, 2023b. doi: 10.48550/ARXIV.2312.10665. URL https://doi.org/10.48550/arXiv.2312.10665.
  17. M33{}^{\mbox{3}}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTit: A large-scale dataset towards multi-modal multilingual instruction tuning. CoRR, abs/2306.04387, 2023c. doi: 10.48550/ARXIV.2306.04387. URL https://doi.org/10.48550/arXiv.2306.04387.
  18. Evaluating object hallucination in large vision-language models. ArXiv, abs/2305.10355, 2023d. URL https://api.semanticscholar.org/CorpusID:258740697.
  19. Microsoft COCO: common objects in context. In European Conference on Computer Vision, volume 8693 of Lecture Notes in Computer Science, pp.  740–755. Springer, 2014.
  20. Aligning large multi-modal model with robust instruction tuning. CoRR, abs/2306.14565, 2023a. doi: 10.48550/ARXIV.2306.14565. URL https://doi.org/10.48550/arXiv.2306.14565.
  21. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
  22. Visual instruction tuning. CoRR, abs/2304.08485, 2023c. doi: 10.48550/arXiv.2304.08485. URL https://doi.org/10.48550/arXiv.2304.08485.
  23. Negative object presence evaluation (nope) to measure object hallucination in vision-language models, 2023.
  24. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. CoRR, abs/2305.14251, 2023.
  25. OpenAI. Chatgpt blog post. https://openai.com/blog/chatgpt, 2022.
  26. Object hallucination in image captioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.  4035–4045. Association for Computational Linguistics, 2018.
  27. Aligning large multimodal models with factually augmented RLHF. CoRR, abs/2309.14525, 2023. doi: 10.48550/ARXIV.2309.14525. URL https://doi.org/10.48550/arXiv.2309.14525.
  28. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10.48550/arXiv.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
  29. Fine-grained human feedback gives better rewards for language model training. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/b8c90b65739ae8417e61eadb521f63d5-Abstract-Conference.html.
  30. mplug-owl: Modularization empowers large language models with multimodality. CoRR, abs/2304.14178, 2023. doi: 10.48550/arXiv.2304.14178. URL https://doi.org/10.48550/arXiv.2304.14178.
  31. Woodpecker: Hallucination correction for multimodal large language models. CoRR, abs/2310.16045, 2023.
  32. RLHF-V: towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. CoRR, abs/2312.00849, 2023. doi: 10.48550/ARXIV.2312.00849. URL https://doi.org/10.48550/arXiv.2312.00849.
  33. Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. CoRR, abs/2310.01779, 2023. doi: 10.48550/ARXIV.2310.01779. URL https://doi.org/10.48550/arXiv.2310.01779.
  34. Analyzing and mitigating object hallucination in large vision-language models. CoRR, abs/2310.00754, 2023. doi: 10.48550/ARXIV.2310.00754. URL https://doi.org/10.48550/arXiv.2310.00754.
  35. Aligning modalities in vision large language models via preference fine-tuning. CoRR, abs/2402.11411, 2024. doi: 10.48550/ARXIV.2402.11411. URL https://doi.org/10.48550/arXiv.2402.11411.
  36. Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR, abs/2304.10592, 2023.
  37. Fine-tuning language models from human preferences. CoRR, abs/1909.08593, 2019. URL http://arxiv.org/abs/1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Liqiang Jing (21 papers)
  2. Xinya Du (41 papers)
Citations (9)