Papers
Topics
Authors
Recent
Search
2000 character limit reached

TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts

Published 23 Oct 2024 in cs.CV, cs.AI, and cs.CL | (2410.18071v1)

Abstract: Recently, multimodal LLMs (MLLMs) have received much attention for their impressive capabilities. The evaluation of MLLMs is becoming critical to analyzing attributes of MLLMs and providing valuable insights. However, current benchmarks overlook the problem of prompt sensitivity - minor prompt variations may lead to significant performance fluctuations. Thus, inappropriate prompts may obscure the models' capabilities, underestimating the models' performance. Moreover, different models have different preferences for different prompts, and thus, using the same prompt for all models will cause evaluation bias. This paper analyzes this deficiency in existing benchmarks and further introduces a new evaluation framework named TP-Eval, which introduces a prompt customization method to reduce evaluation biases and tap models' potential. TP-Eval will rewrite the original prompts to different customized prompts for different models. In particular, we propose some well-designed modules for prompt customization tailored to the scenario of MLLM evaluation. Extensive experiments demonstrate the effectiveness of our approach to uncovering models' capabilities, and TP-Eval should benefit the community in developing more comprehensive and convincing MLLM evaluation benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites, 2024.
  3. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024.
  4. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, pp.  2, 2019.
  5. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  6. Seed-bench: Benchmarking multimodal llms with generative comprehension, 2023.
  7. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  8. Visual instruction tuning. Advances in neural information processing systems, 36, 2024a.
  9. Mmbench: Is your multi-modal model an all-around player?, 2024b.
  10. Deepseek-vl: Towards real-world vision-language understanding, 2024.
  11. Grips: Gradient-free, edit-based instruction search for prompting large language models. arXiv preprint arXiv:2203.07281, 2022.
  12. Automatic prompt optimization with” gradient descent” and beam search. arXiv preprint arXiv:2305.03495, 2023.
  13. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324, 2023.
  14. Unleashing the potential of large language models as prompt optimizers: An analogical analysis with gradient-based model optimizers. arXiv preprint arXiv:2402.17564, 2024.
  15. Large Language Models as Optimizers. arXiv e-prints, art. arXiv:2309.03409, September 2023. doi: 10.48550/arXiv.2309.03409.
  16. Heng Yang and Ke Li. Instoptima: Evolutionary multi-objective instruction optimization via large language model-based instruction operators. arXiv preprint arXiv:2310.17630, 2023.
  17. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi. arXiv preprint arXiv:2404.16006, 2024.
  18. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
  19. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9556–9567, 2024.
  20. Mitigating the inconsistency between word saliency and model confidence with pathological contrastive training. In Findings of the Association for Computational Linguistics: ACL 2022, pp.  2226–2244, 2022.
  21. Contrastive learning with adversarial examples for alleviating pathology of language model. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  6493–6508, 2023.
  22. Rethinking word-level adversarial attack: The trade-off between efficiency, effectiveness, and imperceptibility. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp.  14037–14052, 2024.
  23. Tempera: Test-time prompting via reinforcement learning. arXiv preprint arXiv:2211.11890, 2022.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 1 like about this paper.