Aligning Multimodal LLM with Human Preference: A Survey (2503.14504v2)

Published 18 Mar 2025 in cs.CV

Abstract: LLMs can handle a wide variety of general tasks with simple prompts, without the need for task-specific training. Multimodal LLMs (MLLMs), built upon LLMs, have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data. However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment with human preference remain insufficiently addressed. This gap has spurred the emergence of various alignment algorithms, each targeting different application scenarios and optimization goals. Recent studies have shown that alignment algorithms are a powerful approach to resolving the aforementioned challenges. In this paper, we aim to provide a comprehensive and systematic review of alignment algorithms for MLLMs. Specifically, we explore four key aspects: (1) the application scenarios covered by alignment algorithms, including general image understanding, multi-image, video, and audio, and extended multimodal applications; (2) the core factors in constructing alignment datasets, including data sources, model responses, and preference annotations; (3) the benchmarks used to evaluate alignment algorithms; and (4) a discussion of potential future directions for the development of alignment algorithms. This work seeks to help researchers organize current advancements in the field and inspire better alignment methods. The project page of this paper is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Alignment.

PDF Abstract

Overview

The survey "Aligning Multimodal LLM with Human Preference: A Survey" (Yu et al., 18 Mar 2025 ) provides a systematic review of current alignment strategies for Multimodal LLMs (MLLMs). The work explores the intricacies of aligning multimodal architectures with human preferences by evaluating a broad spectrum of alignment algorithms, dataset construction paradigms, benchmark frameworks, and emerging challenges. The analysis primarily addresses three dimensions: the alignment algorithms tailored to varying application scenarios, the underpinning loss functions and training regimes, and comprehensive evaluations using domain-specific benchmarks.

Methodologies and Alignment Algorithms

The survey emphasizes a multi-stage training pipeline for MLLMs comprising:

Pre-Training: Modality-specific representations are harmonized via large-scale captioned data. While simple image-caption pairs are predominant, nuances in aligning visual and textual embeddings remain critical.
Instruction Tuning (SFT): Utilizing heterogeneous dialogue datasets, the approach models instruction-following behavior crucial for subsequent human-interaction protocols.
Human Preference Alignment: A specialized stage addressing safety, hallucination mitigation, and preference congruency, often employing Reinforcement Learning (RL) strategies and Direct Preference Optimization (DPO) variants.

Within the alignment phase, the survey categorizes algorithms into three primary frameworks:

General Image Understanding: Techniques such as Fact-RLHF, mDPO, and variants like HA-DPO are detailed. These methods emphasize mitigating hallucination and enhancing an integrated response across conversational and reasoning tasks. Loss functions are predominantly adaptations of standard DPO losses augmented with modality-specific penalties.
Multi-Image, Video, and Audio Modalities: This category extends single-image alignment methods to more complex structures. Notable methods include MIA-DPO for multi-image contexts and approaches like LLaVA-NEXT-Interleave and PPLLaVA for video data. Audio-visual tasks are addressed by methodologies such as Video-SALMONN 2, while audio-text interactions are managed via frameworks like SQuBa.
Extended Multimodal Applications: The survey also presents alignment strategies tailored to domain-specific tasks (e.g., medical imaging with 3D-CT-GPT++, mathematical problem solving via MAVIS, and embodied intelligence with INTERACTIVECOT). These approaches often require custom loss functions and domain-adapted preference annotations.

The survey meticulously summarizes a diverse set of alignment loss functions (e.g., visual DPO, sentence-level DPO) in tabular form, allowing a comparative analysis of performance and optimization trade-offs. The formulation of these losses plays a vital role in both reward signal clarity and stable convergence during training.

Benchmarking and Evaluation Metrics

The evaluation framework is a critical component of the survey. The benchmarks are organized into six distinct dimensions:

General Knowledge: Datasets such as MME-RealWorld, MMStar, and MMBench are used to assess the fundamental understanding and knowledge grounding capabilities. Quantitative performance on these benchmarks is essential for analyzing the scalability of alignment methods.
Hallucination Detection: With benchmarks including Object HalBench, VideoHallucer, and HallusionBench, the survey quantifies mismatch rates between generated outputs and factual information. Numerical results in these dimensions illustrate the persistent challenge of hallucinations in multimodal contexts.
Safety: Evaluation using datasets like AdvDiffVLM and VLLM-safety-bench provides insights into the model’s ability to mitigate adverse outputs. Models are compared based on their precision in handling sensitive content and mitigating risk.
Conversational Adequacy: Benchmarks such as Q-Bench and LLDescribe assess the fidelity and relevance of generated responses to interactive tasks.
Reward Modeling: Metrics derived from M-RewardBench and VL-RewardBench gauge the performance and reliability of the internal reward systems driving preference alignment.
Overall Alignment: Using specialized suites like AlignBench and MM-AlignBench, the survey facilitates a consolidated view of how well models adhere to human preference annotations, with strong numerical evaluations highlighting the areas for improvement.

These benchmarks not only provide quantitative measures but also reveal qualitative distinctions between different alignment algorithms. The extensive evaluation across multiple modalities underscores the challenges inherent to scalable alignment systems.

Future Directions and Open Challenges

Several critical directions for future research are identified:

Data Challenges: The scarcity of high-quality, multimodal alignment datasets remains a significant bottleneck. Future efforts must focus on balancing data diversity and annotation quality while considering the cost implications.
Optimizing Visual Information Usage: The survey calls for improved methodologies to exploit visual data beyond rudimentary captioning. Current practices, such as using corrupted images as negative samples or cosine-similarity measures derived from CLIP embeddings, are highlighted alongside their limitations.
Comprehensive and Multi-Dimensional Evaluation: There is an advocacy for evaluating alignment performance across broader, more diverse task benchmarks to reflect the generalizability of alignment strategies.
Beyond Image/Text Modalities: Integrating modalities like audio, video, and even more complex sensory data will require novel loss formulations and architecture adjustments.
Enhanced MLLM Reasoning: Drawing parallels with recent advancements in LLM reasoning, the survey emphasizes multi-stage optimization techniques, advanced RL strategies, and online sampling mechanisms. Such modifications are imperative to address overoptimization and reward-hacking issues.
MLLMs as Autonomous Agents: The research underscores the potential of transforming MLLMs into robust agents, capable of managing multi-agent dynamics, enhanced collaboration, and adversarial robustness in open environments.

Concluding Remarks

The survey provides a detailed landscape of the state-of-the-art in aligning MLLMs with human preference. By systematically categorizing current methods, evaluation benchmarks, and emerging challenges, it offers a comprehensive resource for researchers looking to enhance the performance of multimodal systems. With strong numerical evaluations on safety, hallucination mitigation, and overall alignment quality, the work sets a clear agenda for future innovation. Researchers are encouraged to explore integrated multi-modal loss frameworks and robust dataset curation practices to further narrow the gap between model outputs and human expectations.

PDF Markdown Bookmark Chat (Pro)

Authors (17)

Tao Yu (282 papers)
Yi-Fan Zhang (32 papers)
Chaoyou Fu (46 papers)
Junkang Wu (19 papers)
Jinda Lu (11 papers)
Kun Wang (355 papers)
Xingyu Lu (28 papers)
Yunhang Shen (54 papers)
Guibin Zhang (29 papers)
Dingjie Song (17 papers)
Yibo Yan (39 papers)
Tianlong Xu (11 papers)
Qingsong Wen (139 papers)
Zhang Zhang (77 papers)
Yan Huang (180 papers)
Liang Wang (512 papers)
Tieniu Tan (119 papers)

Related Papers

Find Related Papers