Emergent Mind

Aligning Large Multimodal Models with Factually Augmented RLHF

(2309.14525)
Published Sep 25, 2023 in cs.CV and cs.CL

Abstract

Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalities can result in "hallucination", generating textual outputs that are not grounded by the multimodal information in context. To address the multimodal misalignment issue, we adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment, where human annotators are asked to compare two responses and pinpoint the more hallucinated one, and the vision-language model is trained to maximize the simulated human rewards. We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information such as image captions and ground-truth multi-choice options, which alleviates the reward hacking phenomenon in RLHF and further improves the performance. We also enhance the GPT-4-generated training data (for vision instruction tuning) with previously available human-written image-text pairs to improve the general capabilities of our model. To evaluate the proposed approach in real-world scenarios, we develop a new evaluation benchmark MMHAL-BENCH with a special focus on penalizing hallucinations. As the first LMM trained with RLHF, our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4 (while previous best methods can only achieve the 87% level), and an improvement by 60% on MMHAL-BENCH over other baselines. We opensource our code, model, data at https://llava-rlhf.github.io.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Sign up for a free account or log in to generate a summary of this paper:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems.
  2. PaLM 2 Technical Report
  3. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
  4. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  5. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  6. Constitutional ai: Harmlessness from ai feedback, 2022b
  7. Let there be a clock on the beach: Reducing object hallucination in image captioning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  1381–1390
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901
  9. Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
  10. PaLI: A Jointly-Scaled Multilingual Language-Image Model
  11. PaLI-X: On Scaling up a Multilingual Vision and Language Model
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. https://vicuna.lmsys.org.

  13. PaLM: Scaling Language Modeling with Pathways
  14. Scaling Instruction-Finetuned Language Models
  15. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
  16. PaLM-E: An Embodied Multimodal Language Model
  17. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
  18. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6904–6913, 2017a.
  19. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6904–6913, 2017b.
  20. LoRA: Low-Rank Adaptation of Large Language Models
  21. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38
  22. Language Models (Mostly) Know What They Know
  23. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981
  24. Openassistant conversations – democratizing large language model alignment
  25. OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
  26. Hallucinations in neural machine translation. 2018.
  27. Mimic-it: Multi-modal in-context instruction tuning. 2023a.
  28. Otter: A Multi-Modal Model with In-Context Instruction Tuning
  29. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  30. Evaluating Object Hallucination in Large Vision-Language Models
  31. TruthfulQA: Measuring How Models Mimic Human Falsehoods
  32. Teaching Models to Express Their Uncertainty in Words
  33. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer
  34. Visual instruction tuning. 2023a.
  35. MMBench: Is Your Multi-modal Model an All-around Player?
  36. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
  37. An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models
  38. Understanding blind people’s experiences with computer-generated captions of social media images. In proceedings of the 2017 CHI conference on human factors in computing systems, pp.  5988–5999
  39. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp.  3195–3204
  40. Crosslingual Generalization through Multitask Finetuning
  41. Orca: Progressive Learning from Complex Explanation Traces of GPT-4
  42. OpenAI. OpenAI: Introducing ChatGPT, 2022. https://openai.com/blog/chatgpt.

  43. OpenAI. Gpt-4 technical report
  44. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744
  45. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR
  46. Object Hallucination in Image Captioning
  47. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations
  48. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
  49. John Schulman. Reinforcement learning from human feedback: Progress and challenges, Apr 2023. https://www.youtube.com/watch?v=hhiLw5Q_UFg&ab_channel=BerkeleyEECS. Berkeley EECS.

  50. High-Dimensional Continuous Control Using Generalized Advantage Estimation
  51. Proximal Policy Optimization Algorithms
  52. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pp.  146–162. Springer
  53. REPLUG: Retrieval-Augmented Black-Box Language Models
  54. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021
  55. Self-alignment with principle-following reward models. personal communication, 2023a.
  56. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
  57. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  58. LLaMA: Open and Efficient Foundation Language Models
  59. Llama 2: Open Foundation and Fine-Tuned Chat Models
  60. How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
  61. Finetuned language models are zero-shot learners. In International Conference on Learning Representations
  62. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
  63. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014a.
  64. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014b.
  65. Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
  66. Judging llm-as-a-judge with mt-bench and chatbot arena
  67. Detecting Hallucinated Content in Conditional Neural Sequence Generation
  68. LIMA: Less Is More for Alignment
  69. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
  70. Fine-Tuning Language Models from Human Preferences

Show All 70