Analyzing HermesFlow: Enhancing Multimodal LLMs
This paper introduces HermesFlow, a framework designed to enhance Multimodal LLMs (MLLMs) by bridging the existing gap between their understanding and generation capabilities. Existing MLLMs like Show-o, Transfusion, and Emu3, while highly proficient in understanding tasks, often show comparatively weak performance in generation tasks. The authors of the paper propose HermesFlow to address this imbalance and enhance the overall capabilities of MLLMs.
Insightful Observations
The authors identify a consistent phenomenon where the understanding capabilities of MLLMs outperform their generation abilities across several models, such as VILA-U, Janus, and Show-o. This understanding-generation gap has been an impediment to the balanced function of these models. Importantly, the issue does not simply resolve by increasing the quantity of training data; more sophisticated alignment strategies are required. The paper argues for a structured approach to aligning understanding and generation using homologous preference data.
Approach and Methodology
HermesFlow innovatively implements a Pair-DPO (Direct Preference Optimization) framework that employs homologous input data, capturing both understanding and generation preferences. The framework advances through several rounds of self-play iterative optimization, progressively refining an MLLM’s performance until the gap between understanding and generation is significantly reduced.
The curation of homologous preference data begins with generating potential captions and images from an MLLM, followed by selecting preferred outcomes based on predefined criteria, such as BERT similarity scores and self-VQA (Visual Question Answering) scores. This process ensures that both understanding and generation are optimized concurrently, leveraging insights from each domain to benefit the other.
Empirical Evidence
The evidence supporting the efficacy of HermesFlow is detailed through comprehensive experiments. The approach yields improvements over prior systems along various metrics, demonstrating proficiency on understanding benchmarks like POPE and MME, and generation benchmarks such as GenEval. Quantitative comparisons indicate HermesFlow’s superiority, showcasing a reduction in the performance gap from models like Show-o, by a significant margin. For example, while the understanding and generation gap in Show-o was measured to be 0.087, HermesFlow reduced it to 0.036.
Implications and Future Work
HermesFlow holds promise not only as a framework for enhancing MLLMs but potentially as a foundational alignment strategy for future multimodal models, addressing the current limitations of isolated task improvements. By maintaining balance and fostering synergy between understanding and generation, HermesFlow could play a crucial role in the development of more holistic MLLMs.
The authors also acknowledge existing limitations, such as the need for broader application across different backbone models, suggesting areas for future research. Expanding the range of MLLMs that benefit from HermesFlow might incorporate more diverse data types and problem formulations, potentially boosting the framework's generality and effectiveness.
In summary, HermesFlow presents an effective alignment framework that promises to alleviate the imbalance between multimodal understanding and generation, offering substantial contributions to the field of AI and multimodal technologies.