Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering (2401.13170v4)

Published 24 Jan 2024 in cs.CL

Abstract: Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current evaluation metrics to determine answer equivalence (AE) often do not align with human judgments, particularly more verbose, free-form answers from LLMs (LLM). There are two challenges: a lack of data and that models are too big: LLM-based scorers can correlate better with human judges, but this task has only been tested on limited QA datasets, and even when available, update of the model is limited because LLMs are large and often expensive. We rectify both of these issues by providing clear and consistent guidelines for evaluating AE in machine QA adopted from professional human QA contests. We also introduce a combination of standard evaluation and a more efficient, robust, and lightweight discriminate AE classifier-based matching method (CFMatch, smaller than 1 MB), trained and validated to more accurately evaluate answer correctness in accordance with adopted expert AE rules that are more aligned with human judgments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  2. Jordan Boyd-Graber and Benjamin Börschinger. 2020. What question answering can learn from trivia nerds. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7422–7435, Online. Association for Computational Linguistics.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  4. Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation. arXiv preprint arXiv:2202.07654.
  5. Matt Carberry. 2019. Jeopardy! casebook. http://jeopardy.mattcarberry.com/casebook.html. Accessed: 2024-01-12.
  6. Evaluating question answering evaluation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 119–124, Hong Kong, China. Association for Computational Linguistics.
  7. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  9. Can ken jennings answer 100 questions in 10 minutes? | jeopardy trivia challenge. https://www.youtube.com/watch?v=VideoID. Accessed: Jan 14, 2024.
  10. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  11. Karen Spärck Jones. 2021. A statistical interpretation of term specificity and its application in retrieval. J. Documentation, 60:493–502.
  12. Evaluating open-domain question answering in the era of large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5591–5606, Toronto, Canada. Association for Computational Linguistics.
  13. Evaluating open-domain question answering in the era of large language models. arXiv preprint arXiv:2305.06984.
  14. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  15. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
  16. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  17. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  18. Reasoning over paragraph effects in situations. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 58–62, Hong Kong, China. Association for Computational Linguistics.
  19. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  20. Improving automatic vqa evaluation using large language models. arXiv preprint arXiv:2310.02567.
  21. Claire McNear. 2020. Answers in the Form of Questions. Twelve.
  22. Neurips 2020 efficientqa competition: Systems, analyses and lessons learned. In Proceedings of the NeurIPS 2020 Competition and Demonstration Track, volume 133 of Proceedings of Machine Learning Research, pages 86–111. PMLR.
  23. National Academic Quiz Tournaments, LLC. 2024. Correctness guidelines. Accessed: 2024-01-12.
  24. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
  25. Self-evaluation improves selective generation in large language models.
  26. Pedro Rodriguez and Jordan Boyd-Graber. 2021. Evaluation paradigms in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9630–9642, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  27. ‘just what do you think you’re doing, dave?’ a checklist for responsible data use in NLP. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4821–4833, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  28. Katie Shilton. 2016. Emerging ethics norms in social media research. Big Data Ethics.
  29. What’s in a name? answer equivalence for open-domain question answering.
  30. Joseph Stone and Tim Yohn. 1992. Prime Time and Misdemeanors: Investigating the 1950s TV Quiz Scandal: A D.A.’s Account. Rutgers University Press, New Brunswick, N.J.
  31. Modeling legal reasoning: Lm annotation at the edge of human agreement. arXiv preprint arXiv:2310.18440.
  32. National Academic Quiz Tournaments. 2019. Why does ken jennings play quiz bowl? Online Video. URL: https://www.youtube.com/watch?v=Y5ku181Zm8I.
  33. Llama 2: Open foundation and fine-tuned chat models.
  34. Ellen M. Voorhees and Dawn M. Tice. 2000. The TREC-8 question answering track. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), Athens, Greece. European Language Resources Association (ELRA).
  35. Evaluating open-qa evaluation.
  36. Aligning large language models with human: A survey.
  37. Zhen Wang. 2022. Modern question answering datasets and benchmarks: A survey. arXiv preprint arXiv:2206.15030.
  38. Wikidata Contributors. 2019. Pywikibot - python 3 tutorial/data harvest. Accessed: 2024-01-12.
  39. A critical evaluation of evaluations for long-form question answering. ArXiv, abs/2305.18201.
  40. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
  41. Bertscore: Evaluating text generation with bert. ArXiv, abs/1904.09675.
  42. Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv preprint arXiv:2101.00774.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zongxia Li (14 papers)
  2. Ishani Mondal (23 papers)
  3. Yijun Liang (5 papers)
  4. Huy Nghiem (9 papers)
  5. Jordan Boyd-Graber (68 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets