AI Research Assistant for Computer Scientists

Papers
Topics
Authors
Recent
2000 character limit reached
Aligning Large Multimodal Models with Factually Augmented RLHF (2309.14525)
Published 25 Sep 2023 in cs.CV and cs.CL
Aligning Large Multimodal Models with Factually Augmented RLHF

Overview

  • The paper addresses multimodal misalignment in Large Multimodal Models (LMMs) by adapting Reinforcement Learning from Human Feedback (RLHF) to enhance vision-language alignment.

  • Introduction of Factually Augmented RLHF improves alignment accuracy by integrating factual data and human feedback into the LMM training process, thereby reducing hallucinated outputs.

  • The effectiveness of this approach is confirmed through a new evaluation benchmark, MMHal-Bench, with the model showing significant improvements over existing methods, achieving a 94% performance level relative to GPT-4 on the LLaVA-Bench dataset.

Aligning Large Multimodal Models with Factually Augmented RLHF: A Technical Perspective

The paper "Aligning Large Multimodal Models with Factually Augmented RLHF" addresses the critical issue of multimodal misalignment in Large Multimodal Models° (LMMs°) that can lead to hallucinatory outputs—this refers to generating text that is not grounded in the corresponding multimodal data. The research adapts Reinforcement Learning from Human Feedback° (RLHF°), traditionally used in text domains, to enhance vision-language alignment° in LMMs.

Problem Context and Significance

Large Multimodal Models have shown promise in integrating and interpreting data across different modalities, such as text and images. However, a significant challenge lies in the alignment of these modalities, as misalignment could lead to incorrect or "hallucinated" responses. Unlike the abundance of quality data available for training text-only models, the multimodal context lacks such comprehensive datasets, causing a gap that conventional supervised learning° struggles to bridge.

Methodological Innovations

  1. Factually Augmented RLHF: The paper introduces an innovative alignment algorithm° called Factually Augmented RLHF. This method enhances the reward models° by integrating additional factual data—such as image captions and ground-truth options—thereby mitigating the risk of reward hacking, a situation where models could receive high scores for outputs that are not truly aligned with human evaluations°.
  2. Human Feedback Integration: By having human annotators compare and prioritize less hallucinated responses, the model is trained to optimize human preferences. This provides a feedback loop that incrementally tunes the model from purely synthetic data° learning to real-world grounded data perception.
  3. GPT-4° Enhanced Training Data: Another contribution is the augmentation of GPT-4-generated training data with previously available human-written image-text pairs. This hybrid training helps improve the baseline capabilities of the model, leveraging the vast knowledge base of GPT-4 with the nuanced understanding provided by human data.

Evaluation and Results

The introduction of a new evaluation benchmark, MMHal-Bench°, is a testament to the project's focus on real-world applicability. This benchmark specifically assesses and penalizes hallucinations, ensuring that models are evaluated on their ability to remain grounded in reality. The RLHF-trained LMM° demonstrated considerable improvements, achieving a 94% performance level relative to the text-only GPT-4 on the LLaVA-Bench dataset—surpassing prior methodologies which reached only 87%. Additionally, the approach marked a 60% improvement on MMHal-Bench compared to other baselines, demonstrating its effectiveness in hallucination-prone scenarios.

Implications and Future Directions

Practically, the development of factually consistent LMMs could revolutionize fields requiring precise multimodal interactions°, such as autonomous driving, medical imaging, and human-computer interaction°. Theoretically, this work enriches our understanding of neural alignment° across diverse data types, facilitating more nuanced machine learning models capable of reliable autonomy.

For future advancements, scaling the RLHF paradigm to include more sophisticated architectures and exploring the nuances of more dynamic multimodal interactions with environment-adaptive systems could be pivotal. Investigations could also extend into more complex data domains beyond traditional images and text, integrating auditory or sensorial inputs to create comprehensive models of perception.

Overall, this work represents a rigorous advancement in the alignment of LMMs, providing a foundation for future research dedicated to bridging the multimodal ambits of AI with human-grounded reality. The opensourcing of the model and data further underscores a commitment to collaborative progress in AI research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  4. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  6. Constitutional ai: Harmlessness from ai feedback, 2022b.
  7. Let there be a clock on the beach: Reducing object hallucination in image captioning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  1381–1390, 2022.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a.
  10. PaLI: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  11. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023b.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://vicuna.lmsys.org.
  13. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  14. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  15. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  16. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  17. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
  18. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6904–6913, 2017a.
  19. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6904–6913, 2017b.
  20. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  21. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  22. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
  23. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  24. Openassistant conversations – democratizing large language model alignment, 2023.
  25. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
  26. Hallucinations in neural machine translation. 2018.
  27. Mimic-it: Multi-modal in-context instruction tuning. 2023a.
  28. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023b.
  29. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
  30. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023d.
  31. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  32. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022.
  33. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
  34. Visual instruction tuning. 2023a.
  35. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023b.
  36. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
  37. An empirical study of scaling instruct-tuned large multimodal models. arXiv preprint arXiv:2309.09958, 2023.
  38. Understanding blind people’s experiences with computer-generated captions of social media images. In proceedings of the 2017 CHI conference on human factors in computing systems, pp.  5988–5999, 2017.
  39. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp.  3195–3204, 2019.
  40. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
  41. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
  42. OpenAI. OpenAI: Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  43. OpenAI. Gpt-4 technical report, 2023.
  44. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  45. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
  46. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156, 2018.
  47. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2021.
  48. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  49. John Schulman. Reinforcement learning from human feedback: Progress and challenges, Apr 2023. URL https://www.youtube.com/watch?v=hhiLw5Q_UFg&ab_channel=BerkeleyEECS. Berkeley EECS.
  50. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
  51. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  52. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pp.  146–162. Springer, 2022.
  53. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
  54. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  55. Self-alignment with principle-following reward models. personal communication, 2023a.
  56. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023b.
  57. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  58. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  59. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  60. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023.
  61. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
  62. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  63. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014a.
  64. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014b.
  65. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
  66. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  67. Detecting hallucinated content in conditional neural sequence generation. arXiv preprint arXiv:2011.02593, 2020.
  68. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  69. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  70. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Zhiqing Sun (34 papers)
  2. Sheng Shen (53 papers)
  3. Shengcao Cao (12 papers)
  4. Haotian Liu (62 papers)
  5. Chunyuan Li (118 papers)
  6. Yikang Shen (57 papers)
  7. Chuang Gan (175 papers)
  8. Liang-Yan Gui (15 papers)
  9. Yu-Xiong Wang (76 papers)
  10. Yiming Yang (109 papers)
  11. Kurt Keutzer (176 papers)
  12. Trevor Darrell (299 papers)