Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained Evaluation (2401.06591v1)

Published 12 Jan 2024 in cs.CL
Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained Evaluation

Abstract: Assessing long-form responses generated by Vision-LLMs (VLMs) is challenging. It not only requires checking whether the VLM follows the given instruction but also verifying whether the text output is properly grounded on the given image. Inspired by the recent approach of evaluating LMs with LMs, in this work, we propose to evaluate VLMs with VLMs. For this purpose, we present a new feedback dataset called the Perception Collection, encompassing 15K customized score rubrics that users might care about during assessment. Using the Perception Collection, we train Prometheus-Vision, the first open-source VLM evaluator model that can understand the user-defined score criteria during evaluation. Prometheus-Vision shows the highest Pearson correlation with human evaluators and GPT-4V among open-source models, showing its effectiveness for transparent and accessible evaluation of VLMs. We open-source our code, dataset, and model at https://github.com/kaistAI/prometheus-vision

Introduction to Automated VLM Evaluation

Evaluating the performance of Vision-LLMs (VLMs) can be a complex process. It stretches beyond mere text generation, demanding the output to be not just textually rich but contextually anchored on a given image. The novelty of VLMs means the traditional metrics might not suffice, as they often miss nuanced aspects such as the intricate interplay between visual content and generated text. Existing qualitative approaches, while beneficial, face scalability issues, often being costly and subjective to human bias.

The Concept of VLM-as-a-Judge

The proposed solution in the literature has been the ‘LM-as-a-Judge’ paradigm. This method uses a LLM (LM) to estimate the quality of another LM's output. However, when it comes to VLMs, there is a hitch - the process needs an additional model that can translate visual information to text before evaluation can take place. To circumvent this complexity and potential for error propagation, researchers suggest the adoption of VLMs themselves as judges. This approach directly leverages VLMs' inherent proficiency in parsing visual data for a more streamlined and accurate assessment process.

Introducing Prometheus-Vision

Seeking an advancement in this domain, the paper introduces Prometheus-Vision. It's a novel 13B-parameter VLM designed for evaluation with an open-source ethos. Prometheus-Vision has been trained using a newly curated dataset named the Perception Collection, which contains 15,000 fine-grained score rubrics tapping into user-defined assessment criteria. This training sets the model apart, empowering it to scrutinize based on detailed, custom criteria while offering specific language feedback on output deficiencies. The model demonstrates impressive performance, aligning closer to human judgments and even surpassing open-source models in several benchmarks.

Empirical Results and Considerations

Through rigorous testing, Prometheus-Vision has exhibited a high correlation with human evaluators, particularly on benchmarks imbued with richly diverse real-world images. Moreover, it competes well with closed-source counterparts like GPT-4V, providing an accessible alternative for transparent VLM evaluation. Remarkably, it even shows potential as a critique tool for human assessment, producing high-quality feedback deemed on par with or superior to some proprietary models in certain cases.

Despite its strengths, Prometheus-Vision is not without limitations. Its performance metrics indicate room for improvement when it comes to analyzing text-rich images like charts or diagrams. This suggests that future versions, possibly built on more sophisticated visual encoders, could enhance its efficacy. Additionally, the paper acknowledges a dataset bias toward real-world imagery over text-heavy graphics and suggests this could be a promising direction for future dataset enrichment.

Concluding Remarks

The research presents a significant contribution to the field with its open-source VLM evaluator, Prometheus-Vision. Instrumental in shaping the future trajectory of fine-grained VLM assessments, the model and its training dataset, Perception Collection, signal a shift toward more nuanced, user-centric evaluation methods. The authors encourage further exploration into multi-modal feedback datasets, aiming to broaden the scope and capabilities of VLM evaluators in various contexts, potentially even venturing into evaluations of AI-generated imagery.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Reassessing evaluation practices in visual question answering: A case study on out-of-distribution generalization. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1171–1196.
  2. nocaps: novel object captioning at scale. In Proceedings of the IEEE International Conference on Computer Vision, pages 8948–8957.
  3. Touchstone: Evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890.
  4. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  5. Introducing our multimodal models.
  6. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595.
  7. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793.
  8. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
  9. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
  10. Alpacafarm: A simulation framework for methods that learn from human feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
  11. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
  12. Mllm-bench, evaluating multi-modal llms using gpt-4v. arXiv preprint arXiv:2311.13951.
  13. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913.
  14. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071.
  15. Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada. Association for Computational Linguistics.
  16. Personalized soups: Personalized large language model alignment via post-hoc parameter merging. arXiv preprint arXiv:2310.11564.
  17. Ocr-free document understanding transformer. In European Conference on Computer Vision (ECCV).
  18. Visually-situated natural language understanding with contrastive reading model and frozen large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
  19. Cotever: Chain of thought prompting annotation toolkit for explanation verification. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 195–208.
  20. The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. arXiv preprint arXiv:2305.14045.
  21. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491.
  22. Evallm: Interactive evaluation of large language model prompts on user-defined criteria. arXiv preprint arXiv:2309.13633.
  23. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  24. Lik Xun Yuan. 2023. distilbert-base-multilingual-cased-sentiments-student (revision 2e33845).
  25. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  26. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
  27. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  28. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  29. Improving automatic vqa evaluation using large language models. arXiv preprint arXiv:2310.02567.
  30. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204.
  31. OpenAI. 2022. Chatgpt: Optimizing language models for dialogue.
  32. OpenAI. 2023. Gpt-4 technical report.
  33. OpenAI. 2023. GPT-4V(ision) system card. https://openai.com/research/gpt-4v-system-card.
  34. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  35. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802.
  36. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326.
  37. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525.
  38. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  39. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  40. Fine-grained human feedback gives better rewards for language model training. In Thirty-seventh Conference on Neural Information Processing Systems.
  41. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
  42. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration.
  43. Flask: Fine-grained language model evaluation based on alignment skill sets. arXiv preprint arXiv:2307.10928.
  44. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502.
  45. Llavar: Enhanced visual instruction tuning for text-rich image understanding.
  46. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  47. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  48. Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Seongyun Lee (13 papers)
  2. Seungone Kim (34 papers)
  3. Sue Hyun Park (6 papers)
  4. Geewook Kim (21 papers)
  5. Minjoon Seo (82 papers)
Citations (13)
Youtube Logo Streamline Icon: https://streamlinehq.com