Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating Open-Domain Question Answering in the Era of Large Language Models (2305.06984v3)

Published 11 May 2023 in cs.CL

Abstract: Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of LLMs for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-open, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60%, making it on par with existing top models, and the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-open. We also find that more than 50% of lexical matching failures are attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation.

This paper investigates the efficacy of lexical matching as an evaluation metric for open-domain question answering (QA) systems, particularly in the context of LLMs. The authors argue that lexical matching, which is the standard evaluation method, fails to accurately assess model performance because it requires an exact match between the predicted answer and the gold answer. This is problematic as the set of gold answers is often incomplete, and LLMs frequently generate plausible, yet non-identical answers. The authors conduct a manual evaluation of several open-domain QA models, including LLMs, on a subset of the {open} benchmark dataset and compare the results with lexical matching, a semantic similarity model (BEM), and a zero-shot evaluation method using InstructGPT.

The paper's primary contributions and findings are as follows:

  • Limitations of Lexical Matching: Lexical matching significantly underestimates the true performance of open-domain QA models. The authors observe a large performance gap between lexical matching and human evaluation, with the performance of InstructGPT (zero-shot) increasing by nearly +60% when evaluated by humans.
  • Semantic Equivalence: The majority of lexical matching failures are due to semantic equivalence, where the model's answer is semantically similar to a correct answer but not lexically identical. This includes synonymous answers, elaborations, and tokenization mismatches.
  • Human Evaluation: Human evaluation is essential for accurately assessing open-domain QA models, particularly LLMs, due to their ability to generate long-form, plausible but sometimes incorrect answers.
  • Automated Evaluation Models: Semantic similarity models like BEM show some improvement over lexical matching, particularly in cases where answers are semantically equivalent but not lexically identical. However, BEM still underestimates the performance of models.
  • LLM Evaluation: The authors explored using LLMs to evaluate QA models via a zero-shot prompting method (InstructGPT-eval). The results are promising, showing good agreement with human evaluation, but are prone to misjudging hallucinated long answers generated by LLMs. GPT4-eval is also tested showing similar error patterns to InstructGPT-eval, with marginal improvements.
  • Regex Matching: Regular expression matching, which is used to evaluate models on the CuratedTREC dataset, is more robust than exact match, but still suffers from unnecessary strictness.
  • CuratedTREC 2002 Analysis: The authors also performed experiments on the CuratedTREC 2002 dataset. The results indicate that regex matching, BEM, and InstructGPT-eval produce results that are mostly consistent with human judgements, although they still underestimate the true model performance. Also, human evaluation is necessary for the performance of LLMs to surpass that of the best traditional statistical NLP systems of that time.

The models used in the paper were divided into retriever-reader models (DPR, FiD, ANCE, Contriever, RocketQAv2, FiD-KD, GAR, and R2-D2), end-to-end models (EMDR2 and EviGen), and closed-book models (InstructGPT zero-shot and few-shot). The evaluation datasets included a subset of {open} (301 questions randomly sampled from the 3,610 test questions) and the CuratedTREC 2002 dataset.

The evaluation strategies consisted of:

  • Lexical Matching: Exact match (EM) and F1 score.
  • Supervised Evaluation via Semantic Similarity: Using BEM to classify whether candidate answers are semantically equivalent to the gold answers.
  • Zero-shot Evaluation via Prompting: Using InstructGPT and GPT-4 to evaluate answers by prompting the LLMs to determine if a candidate answer is correct given the question and gold answer.
  • Human Evaluation: Two human annotators independently judge the correctness of the generated answers, with a third annotator resolving disagreements.

The paper also provides a detailed linguistic analysis of the discrepancies between lexical matching and human judgment, categorizing the failure modes of lexical matching into semantic equivalence, symbolic equivalence, intrinsic ambiguity in questions, granularity discrepancies, list-style questions, and incorrect gold answers.

The paper concludes that while automated evaluation methods, such as BEM and LLM-based evaluation, can serve as a reasonable surrogate for lexical matching in some circumstances, they still fall short of the accuracy of human evaluation, particularly for long-form answers generated by LLMs. The authors emphasize the need for more robust evaluation techniques for open-domain QA, especially with the increasing prominence of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Evidentiality-guided generation for knowledge-intensive NLP tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2226–2243, Seattle, United States. Association for Computational Linguistics.
  2. Learning to retrieve reasoning paths over wikipedia graph for question answering. In International Conference on Learning Representations.
  3. Petr Baudiš and Jan Šedivỳ. 2015. Modeling of the question answering task in the YodaQA system. In International Conference of the cross-language evaluation Forum for European languages, CLEF’15, pages 222–228. Springer-Verlag.
  4. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, virtual+Dublin. Association for Computational Linguistics.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  6. Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 291–305, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  7. Evaluating question answering evaluation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 119–124, Hong Kong, China. Association for Computational Linguistics.
  8. MOCHA: A dataset for training and evaluating generative reading comprehension metrics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6521–6532, Online. Association for Computational Linguistics.
  9. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada. Association for Computational Linguistics.
  10. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  11. Christopher Clark and Matt Gardner. 2018. Simple and effective multi-paragraph reading comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 845–855, Melbourne, Australia. Association for Computational Linguistics.
  12. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota. Association for Computational Linguistics.
  13. On the origin of hallucinations in conversational models: Is it the datasets or the models? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5271–5285, Seattle, United States. Association for Computational Linguistics.
  14. R2-D2: A modular baseline for open-domain question answering. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 854–870, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  15. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
  16. Gautier Izacard and Edouard Grave. 2021a. Distilling knowledge from reader to retriever for question answering. In International Conference on Learning Representations.
  17. Gautier Izacard and Edouard Grave. 2021b. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
  18. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  19. Relevance-guided supervision for OpenQA with ColBERT. Transactions of the Association for Computational Linguistics, 9:929–944.
  20. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, Florence, Italy. Association for Computational Linguistics.
  21. Reasoning over paragraph effects in situations. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 58–62, Hong Kong, China. Association for Computational Linguistics.
  22. Generation-augmented retrieval for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4089–4100, Online. Association for Computational Linguistics.
  23. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  24. NeurIPS 2020 EfficientQA competition: Systems, analyses and lessons learned. volume 133 of Proceedings of Machine Learning Research, pages 86–111. PMLR.
  25. AmbigQA: Answering ambiguous open-domain questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5783–5797, Online. Association for Computational Linguistics.
  26. OpenAI. 2023. GPT-4 technical report. Technical report.
  27. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, pages 27730–27744. Curran Associates, Inc.
  28. Marius A. Pasca and Sandra M. Harabagiu. 2001. High performance question/answering. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’01, page 366–374, New York, NY, USA. Association for Computing Machinery.
  29. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  30. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  31. Measuring attribution in natural language generation models. arXiv preprint arXiv:2112.12870.
  32. RocketQAv2: A joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2825–2835, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  33. Semantic answer similarity for evaluating question answering models. In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, pages 149–157, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  34. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426, Online. Association for Computational Linguistics.
  35. QA dataset explosion: A taxonomy of NLP resources for question answering and reading comprehension. ACM Computing Surveys, 55(10):1–45.
  36. What’s in a name? answer equivalence for open-domain question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9623–9629, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  37. End-to-end training of multi-document reader and retriever for open-domain question answering. In Advances in Neural Information Processing Systems, volume 34, pages 25968–25981.
  38. Ellen M. Voorhees. 2003. Overview of the TREC 2002 question answering track. In TREC.
  39. Ellen M. Voorhees and Dawn M. Tice. 2000. The TREC-8 question answering track. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), Athens, Greece. European Language Resources Association (ELRA).
  40. R^3: Reinforced ranker-reader for open-domain question answering. In Proceedings of the AAAI Conference on Artificial Intelligence.
  41. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In International Conference on Learning Representations.
  42. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
  43. The unreliability of explanations in few-shot prompting for textual reasoning. In Advances in Neural Information Processing Systems, volume 35, pages 30378–30392. Curran Associates, Inc.
  44. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  45. BERTScore: Evaluating text generation with bert. In International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ehsan Kamalloo (17 papers)
  2. Nouha Dziri (39 papers)
  3. Charles L. A. Clarke (30 papers)
  4. Davood Rafiei (26 papers)
Citations (65)