TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts
Abstract: Recently, multimodal LLMs (MLLMs) have received much attention for their impressive capabilities. The evaluation of MLLMs is becoming critical to analyzing attributes of MLLMs and providing valuable insights. However, current benchmarks overlook the problem of prompt sensitivity - minor prompt variations may lead to significant performance fluctuations. Thus, inappropriate prompts may obscure the models' capabilities, underestimating the models' performance. Moreover, different models have different preferences for different prompts, and thus, using the same prompt for all models will cause evaluation bias. This paper analyzes this deficiency in existing benchmarks and further introduces a new evaluation framework named TP-Eval, which introduces a prompt customization method to reduce evaluation biases and tap models' potential. TP-Eval will rewrite the original prompts to different customized prompts for different models. In particular, we propose some well-designed modules for prompt customization tailored to the scenario of MLLM evaluation. Extensive experiments demonstrate the effectiveness of our approach to uncovering models' capabilities, and TP-Eval should benefit the community in developing more comprehensive and convincing MLLM evaluation benchmarks.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites, 2024.
- Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, pp. 2, 2019.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Seed-bench: Benchmarking multimodal llms with generative comprehension, 2023.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024a.
- Mmbench: Is your multi-modal model an all-around player?, 2024b.
- Deepseek-vl: Towards real-world vision-language understanding, 2024.
- Grips: Gradient-free, edit-based instruction search for prompting large language models. arXiv preprint arXiv:2203.07281, 2022.
- Automatic prompt optimization with” gradient descent” and beam search. arXiv preprint arXiv:2305.03495, 2023.
- Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324, 2023.
- Unleashing the potential of large language models as prompt optimizers: An analogical analysis with gradient-based model optimizers. arXiv preprint arXiv:2402.17564, 2024.
- Large Language Models as Optimizers. arXiv e-prints, art. arXiv:2309.03409, September 2023. doi: 10.48550/arXiv.2309.03409.
- Heng Yang and Ke Li. Instoptima: Evolutionary multi-objective instruction optimization via large language model-based instruction operators. arXiv preprint arXiv:2310.17630, 2023.
- Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi. arXiv preprint arXiv:2404.16006, 2024.
- Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9556–9567, 2024.
- Mitigating the inconsistency between word saliency and model confidence with pathological contrastive training. In Findings of the Association for Computational Linguistics: ACL 2022, pp. 2226–2244, 2022.
- Contrastive learning with adversarial examples for alleviating pathology of language model. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6493–6508, 2023.
- Rethinking word-level adversarial attack: The trade-off between efficiency, effectiveness, and imperceptibility. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 14037–14052, 2024.
- Tempera: Test-time prompting via reinforcement learning. arXiv preprint arXiv:2211.11890, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.