This paper investigates the efficacy of lexical matching as an evaluation metric for open-domain question answering (QA) systems, particularly in the context of LLMs. The authors argue that lexical matching, which is the standard evaluation method, fails to accurately assess model performance because it requires an exact match between the predicted answer and the gold answer. This is problematic as the set of gold answers is often incomplete, and LLMs frequently generate plausible, yet non-identical answers. The authors conduct a manual evaluation of several open-domain QA models, including LLMs, on a subset of the {open} benchmark dataset and compare the results with lexical matching, a semantic similarity model (BEM), and a zero-shot evaluation method using InstructGPT.
The paper's primary contributions and findings are as follows:
- Limitations of Lexical Matching: Lexical matching significantly underestimates the true performance of open-domain QA models. The authors observe a large performance gap between lexical matching and human evaluation, with the performance of InstructGPT (zero-shot) increasing by nearly +60% when evaluated by humans.
- Semantic Equivalence: The majority of lexical matching failures are due to semantic equivalence, where the model's answer is semantically similar to a correct answer but not lexically identical. This includes synonymous answers, elaborations, and tokenization mismatches.
- Human Evaluation: Human evaluation is essential for accurately assessing open-domain QA models, particularly LLMs, due to their ability to generate long-form, plausible but sometimes incorrect answers.
- Automated Evaluation Models: Semantic similarity models like BEM show some improvement over lexical matching, particularly in cases where answers are semantically equivalent but not lexically identical. However, BEM still underestimates the performance of models.
- LLM Evaluation: The authors explored using LLMs to evaluate QA models via a zero-shot prompting method (InstructGPT-eval). The results are promising, showing good agreement with human evaluation, but are prone to misjudging hallucinated long answers generated by LLMs. GPT4-eval is also tested showing similar error patterns to InstructGPT-eval, with marginal improvements.
- Regex Matching: Regular expression matching, which is used to evaluate models on the CuratedTREC dataset, is more robust than exact match, but still suffers from unnecessary strictness.
- CuratedTREC 2002 Analysis: The authors also performed experiments on the CuratedTREC 2002 dataset. The results indicate that regex matching, BEM, and InstructGPT-eval produce results that are mostly consistent with human judgements, although they still underestimate the true model performance. Also, human evaluation is necessary for the performance of LLMs to surpass that of the best traditional statistical NLP systems of that time.
The models used in the paper were divided into retriever-reader models (DPR, FiD, ANCE, Contriever, RocketQAv2, FiD-KD, GAR, and R2-D2), end-to-end models (EMDR2 and EviGen), and closed-book models (InstructGPT zero-shot and few-shot). The evaluation datasets included a subset of {open} (301 questions randomly sampled from the 3,610 test questions) and the CuratedTREC 2002 dataset.
The evaluation strategies consisted of:
- Lexical Matching: Exact match (EM) and F1 score.
- Supervised Evaluation via Semantic Similarity: Using BEM to classify whether candidate answers are semantically equivalent to the gold answers.
- Zero-shot Evaluation via Prompting: Using InstructGPT and GPT-4 to evaluate answers by prompting the LLMs to determine if a candidate answer is correct given the question and gold answer.
- Human Evaluation: Two human annotators independently judge the correctness of the generated answers, with a third annotator resolving disagreements.
The paper also provides a detailed linguistic analysis of the discrepancies between lexical matching and human judgment, categorizing the failure modes of lexical matching into semantic equivalence, symbolic equivalence, intrinsic ambiguity in questions, granularity discrepancies, list-style questions, and incorrect gold answers.
The paper concludes that while automated evaluation methods, such as BEM and LLM-based evaluation, can serve as a reasonable surrogate for lexical matching in some circumstances, they still fall short of the accuracy of human evaluation, particularly for long-form answers generated by LLMs. The authors emphasize the need for more robust evaluation techniques for open-domain QA, especially with the increasing prominence of LLMs.