Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLaVA-Critic: Learning to Evaluate Multimodal Models (2410.02712v1)

Published 3 Oct 2024 in cs.CV and cs.CL
LLaVA-Critic: Learning to Evaluate Multimodal Models

Abstract: We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (1) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (2) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.

Overview of LLaVA-Critic: Evaluating Multimodal Models

The paper introduces LLaVA-Critic, an open-source large multimodal model (LMM) specifically designed to evaluate the performance of multimodal tasks. Despite the growing sophistication of LMMs, the field has lacked an open, generalized evaluator until this contribution. The focus of LLaVA-Critic is twofold: it serves as a judge providing evaluation scores comparable to proprietary models like GPT-4V and supports preference learning by generating reward signals for these models.

Key Contributions

  1. Critic Instruction-Following Dataset: The work presents an extensive dataset totaling 113k samples, curated to enhance the model's ability to follow evaluation instructions. This dataset spans diverse scenarios, including pointwise and pairwise evaluations, consisting of features like image and text pairs, model responses, scores, and justifications.
  2. Model Development: LLaVA-Critic builds on existing strong models from the LLaVA-OneVision suite. It demonstrates improved alignment with GPT-4o, offering reliable judgements and consistently outperforming baseline models across multiple benchmarks.
  3. Open-Source Accessibility: All datasets, codebases, and trained model checkpoints are released to facilitate further research and development of generalized visual assistants.

Experimental Insights

In-Domain Pointwise Scoring

In experiments comparing LLaVA-Critic with GPT-4o across seven multimodal benchmarks, LLaVA-Critic aligns closely with GPT-4o in both instance-level scoring and model-level ranking. The use of high-quality evaluation datasets accounts for its accuracy and consistency. The model demonstrates significant improvements over both LLaVA-NeXT and LLaVA-OneVision baselines.

In-Domain Pairwise Ranking

Utilizing human-annotated data from the WildVision Arena, LLaVA-Critic closely matches human preferences, particularly with its 72B variant. This reflects its robust capability to effectively rank model responses in complex, real-world scenarios.

Out-of-Domain Evaluation

On the MLLM-as-a-Judge benchmark, LLaVA-Critic shows comparable performance to commercial models (e.g., GPT-4V), demonstrating its robustness and generalizability across unseen tasks and prompts. Notably, LLaVA-Critic's capacity is significantly enhanced by both data and model scaling.

Application in Preference Learning

LLaVA-Critic contributes to preference learning by serving as a reward signal generator in the iterative Direct Preference Optimization (DPO) process. It demonstrates noteworthy improvements in the base model’s visual chat capabilities, cementing its role as a non-proprietary alternative for aligning LLMs with human-like preferences.

Implications and Future Directions

The research underscores a shift towards utilizing open-source models for evaluation, thereby reducing reliance on proprietary systems. It highlights the potential for creating scalable self-critique mechanisms in AI, enabling multimodal models to offer superhuman alignment feedback. Moving forward, this opens new avenues for refining LMMs with improved interpretability and transparency, aligning them more closely with human evaluators.

In sum, this work provides a significant stride in multimodal model evaluation, backed by a strong combination of open-source accessibility, comprehensive datasets, and a robust model architecture. The implications for future development in AI alignment and multimodal model assessment are profound, presenting both practical applications and theoretical advancements.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Tianyi Xiong (6 papers)
  2. Xiyao Wang (26 papers)
  3. Dong Guo (46 papers)
  4. Qinghao Ye (31 papers)
  5. Haoqi Fan (33 papers)
  6. Quanquan Gu (198 papers)
  7. Heng Huang (189 papers)
  8. Chunyuan Li (122 papers)
Citations (6)