Emergent Mind

Improving Automatic VQA Evaluation Using Large Language Models

(2310.02567)
Published Oct 4, 2023 in cs.CV , cs.AI , cs.CL , and cs.LG

Abstract

8 years after the visual question answering (VQA) task was proposed, accuracy remains the primary metric for automatic evaluation. VQA Accuracy has been effective so far in the IID evaluation setting. However, our community is undergoing a shift towards open-ended generative models and OOD evaluation. In this new paradigm, the existing VQA Accuracy metric is overly stringent and underestimates the performance of VQA systems. Thus, there is a need to develop more robust automatic VQA metrics that serve as a proxy for human judgment. In this work, we propose to leverage the in-context learning capabilities of instruction-tuned LLMs to build a better VQA metric. We formulate VQA evaluation as an answer-rating task where the LLM is instructed to score the accuracy of a candidate answer given a set of reference answers. We demonstrate the proposed metric better correlates with human judgment compared to existing metrics across several VQA models and benchmarks. We hope wide adoption of our metric will contribute to better estimating the research progress on the VQA task. We plan to release the evaluation code and collected human judgments.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Sign up for a free account or log in to generate a summary of this paper:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization. In Findings of the Association for Computational Linguistics: EACL 2023, 1171–1196.
  2. Anthropic. 2023. Introducing Claude.
  3. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, 2425–2433.
  4. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65–72.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
  6. Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation. In Conference on Empirical Methods in Natural Language Processing.
  7. Evaluating question answering evaluation. In Proceedings of the 2nd workshop on machine reading for question answering, 119–124.
  8. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality
  9. Scaling Instruction-Finetuned Language Models
  10. GPTScore: Evaluate as You Desire
  11. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6904–6913.
  12. PromptCap: Prompt-Guided Task-Aware Image Captioning
  13. Evaluating Open-Domain Question Answering in the Era of Large Language Models
  14. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123: 32–73.
  15. KPQA: A Metric for Generative Question Answering Using Keyphrase Weights. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2105–2115.
  16. Mimic-it: Multi-modal in-context instruction tuning
  17. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  18. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 12888–12900. PMLR.
  19. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
  20. ‘Just because you are right, doesn’t mean I am wrong’: Overcoming a bottleneck in development and evaluation of Open-Ended VQA tasks. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2766–2771.
  21. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, 3195–3204.
  22. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1906–1919.
  23. OpenAI. 2022. Introducing ChatGPT.
  24. OpenAI, R. 2023. GPT-4 technical report. arXiv, 2303–08774.
  25. Can foundation models label data like humans? Hugging Face Blog. Https://huggingface.co/blog/llm-leaderboard.
  26. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–3992.
  27. Semantic Answer Similarity for Evaluating Question Answering Models. In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, 149–157.
  28. What’s in a Name? Answer Equivalence For Open-Domain Question Answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 9623–9629.
  29. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4566–4575.
  30. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 5085–5109.
  31. Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations.
  32. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems.
  33. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 38–45.
  34. OPT: Open Pre-trained Transformer Language Models
  35. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations.
  36. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  37. LIMA: Less Is More for Alignment

Show All 37