Assessment of Multimodal Large Language Models in Alignment with Human Values (2403.17830v1)
Abstract: LLMs aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh). However, in terms of Multimodal LLMs (MLLMs), despite their commendable performance in perception and reasoning tasks, their alignment with human values remains largely unexplored, given the complexity of defining hhh dimensions in the visual world and the difficulty in collecting relevant data that accurately mirrors real-world situations. To address this gap, we introduce Ch3Ef, a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations. Ch3Ef dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle. We also present a unified evaluation strategy supporting assessment across various scenarios and different perspectives. Based on the evaluation results, we summarize over 10 key findings that deepen the understanding of MLLM capabilities, limitations, and the dynamic relationships between evaluation levels, guiding future advancements in the field.
- Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966, 2023a.
- Iason Gabriel. Artificial intelligence, values, and alignment. Minds and Machines, 30(3):411–437, 2020.
- In conversation with artificial intelligence: Aligning language models with human values. Philosophy & Technology, 36(2):27, 2023.
- Training language models to follow instructions with human feedback. NeurIPS, 35:27730–27744, 2022.
- Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Visual instruction tuning. In NeurIPS, 2023a.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023a.
- LAMM: language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. In NeurIPS, 2023.
- From gpt-4 to gemini and beyond: Assessing the landscape of mllms on generalizability, trustworthiness and causality through four modalities. arXiv preprint arXiv:2401.15071, 2024.
- Safety of multimodal large language models on images and text. arXiv preprint arXiv:2402.00357, 2024.
- Query-relevant images jailbreak large multi-modal models. arXiv preprint arXiv:2311.17600, 2023b.
- Beyond task performance: Evaluating and reducing the flaws of large multimodal models with in-context learning. arXiv preprint arXiv:2310.00647, 2023.
- Evaluating object hallucination in large vision-language models. In EMNLP 2023, pages 292–305, 2023a.
- Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023b.
- Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023.
- Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving. arXiv preprint arXiv:2312.09245, 2023b.
- Unified hallucination detection for multimodal large language models. arXiv preprint arXiv:2402.03190, 2024.
- Visually dehallucinative instruction generation: Know what you don’t know. arXiv preprint arXiv:2402.09717, 2024.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024a.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023c.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
- Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022.
- Obtaining well calibrated probabilities using bayesian binning. In AAAI, pages 2901–2907, 2015.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Language models are few-shot learners. In NeurIPS, pages 1877–1901, 2020.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, volume 202, pages 19730–19742, 2023c.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
- InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023. Accessed: 2023-12-26.
- Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. arXiv preprint arXiv:2312.00849, 2023.
- Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
- Learning to summarize with human feedback. NeurIPS, 33:3008–3021, 2020.
- Hod: A benchmark dataset for harmful object detection. arXiv preprint arXiv:2310.05192, 2023.
- Goat-bench: Safety insights to large multimodal models through meme-based social abuse. arXiv preprint arXiv:2401.01523, 2024.
- Safety fine-tuning at (almost) no cost: A baseline for vision large language models. arXiv preprint arXiv:2402.02207, 2024.
- Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 36, 2024.
- Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023b.
- A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056, 2024.
- Aligner: Achieving efficient alignment through weak-to-strong correction. arXiv preprint arXiv:2402.02416, 2024b.
- Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662, 2023.
- Halle-switch: Controlling object hallucination in large vision language models. arXiv e-prints, pages arXiv–2310, 2023.
- Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983, 2023.
- Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
- Sociotechnical safety evaluation of generative ai systems. arxiv, 2023.
- Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024.
- Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044, 2024.
- OpenAI. Usage policies. Website, 2024. https://openai.com/policies/usage-policies.
- Meta. Use policies. Website, 2024. https://ai.meta.com/llama/use-policy/.
- Google. Use policies. Website, 2024. https://policies.google.com/terms/generative-ai/use-policy.
- NIST AI. Artificial intelligence risk management framework (ai rmf 1.0). 2023.
- Jessica Newman. A taxonomy of trustworthiness for artificial intelligence. CLTC: North Charleston, SC, USA, 1, 2023.
- Decodingtrust: A comprehensive assessment of trustworthiness in GPT models. In NeurIPS, 2023c.
- Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374, 2023d.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Michel Cannarsa. Ethics guidelines for trustworthy ai. The Cambridge handbook of lawyering in the digital age, pages 283–297, 2021.
- On fairness and calibration. NIPS, 30, 2017.
- A survey of safety and trustworthiness of large language models through the lens of verification and validation. arXiv preprint arXiv:2305.11391, 2023.
- Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. In Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 2023.
- OpenICL: An open-source framework for in-context learning. In ACL, 2023.
- OpenNMT: Open-source toolkit for neural machine translation. In ACL, 2017.
- Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023.
- Can large language models be an alternative to human evaluations? In ACL, pages 15607–15631, 2023.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023e.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257, 2023.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023d.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
- A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Handbook of Systemic Autoimmune Diseases, 1(4), 2009.
- Benchmarking omni-vision representation through the lens of visual realms. In ECCV, pages 594–611, 2022a.
- The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html, 2012.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, pages 67–78, 2014.
- Learning to count everything. In CVPR, pages 3394–3403, 2021.
- A dataset of clinically generated visual questions and answers about radiology images. Scientific Data, 5(1):180251, Nov 2018. ISSN 2052-4463. doi: 10.1038/sdata.2018.251. URL https://doi.org/10.1038/sdata.2018.251.
- Radiopaedia. Radiopaedia. Website, 2024. https://radiopaedia.org/.
- Rsvqa: Visual question answering for remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555–8566, December 2020. ISSN 1558-0644. doi: 10.1109/tgrs.2020.2988782. URL http://dx.doi.org/10.1109/TGRS.2020.2988782.
- Cartoon face recognition: A benchmark dataset. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2264–2272, 2020.
- Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019.
- Slidevqa: A dataset for document visual question answering on multiple images. In AAAI, 2023.
- End-to-end scene text recognition. In 2011 International conference on computer vision, pages 1457–1464. IEEE, 2011.
- Uncertainty-based traffic accident anticipation with spatio-temporal relational learning. In ACM Multimedia Conference, May 2020.
- Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation. arXiv preprint, 2023.
- What one intelligence test measures: A theoretical account of the processing in the raven progressive matrices test. Psychological review, 97:404–31, 07 1990. doi: 10.1037/0033-295X.97.3.404.
- Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15(2018):11, 2018.
- Microsoft coco: Common objects in context, Jan 2014. URL http://dx.doi.org/10.1007/978-3-319-10602-1_48.
- Privacyalert: A dataset for image privacy prediction. Proceedings of the International AAAI Conference on Web and Social Media, 16(1):1352–1361, May 2022. doi: 10.1609/icwsm.v16i1.19387. URL https://ojs.aaai.org/index.php/ICWSM/article/view/19387.
- Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
- The hateful memes challenge: Detecting hate speech in multimodal memes, 2021.
- Tovilag: Your visual-language generative model is also an evildoer, 2023d.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, pages 3354–3361, 2012.
- nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019.
- Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety, 2024.
- What makes good in-context examples for gpt-3? In DeeLIO 2022, 2022. doi: 10.18653/v1/2022.deelio-1.10.
- Selective annotation makes language models better few-shot learners. In ICLR, 2023.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023e.
- Bamboo: Building mega-scale vision dataset continually with human-machine synergy. arXiv preprint arXiv:2203.07845, 2022b.
- Zhelun Shi (9 papers)
- Zhipin Wang (5 papers)
- Hongxing Fan (6 papers)
- Zaibin Zhang (6 papers)
- Lijun Li (30 papers)
- Yongting Zhang (4 papers)
- Zhenfei Yin (41 papers)
- Lu Sheng (63 papers)
- Yu Qiao (563 papers)
- Jing Shao (109 papers)