Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Exam-based Evaluation Approach Beyond Traditional Relevance Judgments (2402.00309v1)

Published 1 Feb 2024 in cs.IR

Abstract: Current IR evaluation is based on relevance judgments, created either manually or automatically, with decisions outsourced to LLMs. We offer an alternative paradigm, that never relies on relevance judgments in any form. Instead, a text is defined as relevant if it contains information that enables the answering of key questions. We use this idea to design the EXAM Answerability Metric to evaluate information retrieval/generation systems for their ability to provide topically relevant information. We envision the role of a human judge to edit and define an exam question bank that will test for the presence of relevant information in text. We support this step by generating an initial set of exam questions. In the next phase, an LLM-based question answering system will automatically grade system responses by tracking which exam questions are answerable with which system responses. We propose two evaluation measures, the recall-oriented EXAM Cover metric, and the precision-oriented EXAM Qrels metric, the latter which can be implemented with trec_eval. This paradigm not only allows for the expansion of the exam question set post-hoc but also facilitates the ongoing evaluation of future information systems, whether they focus on retrieval, generation, or both.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers. arXiv preprint arXiv:2401.04842 (2024).
  2. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023).
  3. James Clarke and Mirella Lapata. 2010. Discourse Constraints for Document Compression. Computational Linguistics 36, 3 (2010).
  4. Overview of the TREC 2020 deep learning track. arXiv preprint arXiv:2102.07662 (2021).
  5. Overview of the TREC 2019 deep learning track. arXiv preprint arXiv:2003.07820 (2020).
  6. Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary. arXiv preprint arXiv:2010.00490 (2020).
  7. Laura Dietz and John Foley. 2019. TREC CAR Y3: Complex Answer Retrieval Overview. In Proceedings of Text REtrieval Conference (TREC).
  8. Question Answering as an Automatic Evaluation Metric for News Article Summarization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 3938–3948. https://doi.org/10.18653/v1/N19-1395
  9. Perspectives on large language models for relevance judgment. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 39–50.
  10. Raymond Fok and Daniel S Weld. 2023. In Search of Verifiability: Explanations Rarely Enable Complementary Performance in AI-Advised Decision Making. arXiv preprint arXiv:2305.07722 (2023).
  11. Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.457
  12. Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 5376–5384.
  13. LLatrieval: LLM-Verified Retrieval for Verifiable Generation. arXiv preprint arXiv:2311.07838 (2023).
  14. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022).
  15. Jimmy Lin and Dina Demner-Fushman. 2006. Will pyramids built of nuggets topple over?. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference. 383–390.
  16. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688 (2023).
  17. Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. arXiv preprint arXiv:2302.11266 (2023).
  18. Richard McCreadie and Cody Buntain. 2023. CrisisFACTS: Buidling and Evaluating Crisis Timelines. Technical Report. Univerity of Glasgow.
  19. Nikahat Mulla and Prachi Gharpure. 2023. Automatic question generation: a review of methodologies, datasets, evaluation metrics, and applications. Progress in Artificial Intelligence 12, 1 (2023), 1–32.
  20. IR system evaluation using nugget-based test collections. In Proceedings of the fifth ACM international conference on Web search and data mining. 393–402.
  21. Improving passage retrieval with zero-shot question generation. arXiv preprint arXiv:2204.07496 (2022).
  22. David P Sander and Laura Dietz. 2021. EXAM: How to Evaluate Retrieve-and-Generate Systems for Users Who Do Not (Yet) Know What They Want.. In DESIRES. 136–146.
  23. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. arXiv e-prints (2023), arXiv–2304.
  24. Large language models can accurately predict searcher preferences. arXiv:2309.10621 [cs.IR]
  25. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5008–5020. https://doi.org/10.18653/v1/2020.acl-main.450
  26. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926 (2023).
  27. Large Language Models are Better Reasoners with Self-Verification. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  28. Xuan Zhang and Wei Gao. 2023. Towards llm-based fact verification on news claims with a hierarchical step-by-step prompting method. arXiv preprint arXiv:2310.00305 (2023).
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com