Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Comparison of Methods for Evaluating Generative IR (2404.04044v2)

Published 5 Apr 2024 in cs.IR

Abstract: Information retrieval systems increasingly incorporate generative components. For example, in a retrieval augmented generation (RAG) system, a retrieval component might provide a source of ground truth, while a generative component summarizes and augments its responses. In other systems, a LLM might directly generate responses without consulting a retrieval component. While there are multiple definitions of generative information retrieval (Gen-IR) systems, in this paper we focus on those systems where the system's response is not drawn from a fixed collection of documents or passages. The response to a query may be entirely new text. Since traditional IR evaluation methods break down under this model, we explore various methods that extend traditional offline evaluation approaches to the Gen-IR context. Offline IR evaluation traditionally employs paid human assessors, but increasingly LLMs are replacing human assessment, demonstrating capabilities similar or superior to crowdsourced labels. Given that Gen-IR systems do not generate responses from a fixed set, we assume that methods for Gen-IR evaluation must largely depend on LLM-generated labels. Along with methods based on binary and graded relevance, we explore methods based on explicit subtopics, pairwise preferences, and embeddings. We first validate these methods against human assessments on several TREC Deep Learning Track tasks; we then apply these methods to evaluate the output of several purely generative systems. For each method we consider both its ability to act autonomously, without the need for human labels or other input, and its ability to support human auditing. To trust these methods, we must be assured that their results align with human assessments. In order to do so, evaluation criteria must be transparent, so that outcomes can be audited by human assessors.

Evaluative Methods for Generative Information Retrieval Systems

Introduction

The increasing integration of generative components in information retrieval (IR) systems necessitates a reevaluation of traditional offline evaluation methods. Gen-IR systems, characterized by their ability to produce responses not confined to a pre-existing corpus, present unique challenges for evaluation. This paper investigates various methods extending traditional offline IR evaluation to suit the Gen-IR context, emphasizing the operationalization of LLMs in evaluation processes.

Methods Explored

The exploration covers five distinct methods, each with its potential for autonomous operation and capacity for human auditability:

  • Binary Relevance: Engages LLMs to assess query/response pairs for relevancy, supporting straightforward auditing by human assessors.
  • Graded Relevance: Amplifies binary relevance by introducing multiple grades of relevance, although it suffers slightly from the need for calibrating human and LLM assessors to these grades.
  • Subtopic Relevance: Utilizes LLM-generated subtopics to refine relevance evaluation, promising greater detail in relevancy assessments and offering an optimal balance between autonomy and auditability.
  • Pairwise Preferences: Prioritizes direct comparison between two responses, showing higher performance in distinguishing the nuances between responses but requires exemplars for comparison.
  • Embeddings: Leverages cosine similarity between the embeddings of an exemplar and generated responses, providing a method that, while not directly auditable, aligns well with human assessments in comparative contexts.

Validation and Results

The validation employed TREC Deep Learning Track datasets, applying the aforementioned methods to assess the alignment with human judgments and their efficacy in distinguishing between generative models' outputs. Key insights include:

  • Methods like subtopic relevance and pairwise preferences showed promise in nuanced differentiation between responses.
  • Pairwise preferences, while computationally demanding, provided a clear advantage in performance recognition but hinged on the availability of exemplars.
  • Subtopic relevance emerged as a method offering substantial detail, allowing for a nuanced understanding of response relevance without extensive human input aside from auditing.

Implications and Future Directions

This work underscores the evolving need for Gen-IR evaluation methodologies that can effectively measure the novel outputs of generative systems. It highlights the potential of LLMs not only as tools in generating responses but also as critical components in the evaluation infrastructure of Gen-IR systems. The future of IR evaluation, as indicated by these findings, will likely rely more heavily on advanced models and autonomous methods, with human oversight ensuring alignment with user expectations and real-world relevance.

The exploration points to several directions for future research, including extending these evaluative methods to broader datasets and contexts, refining the balance between autonomous evaluations and human auditability, and adapting methodologies to the evolving capabilities of Gen-IR systems.

Conclusion

The transition towards generative models in information retrieval poses significant challenges and opportunities for the field of IR evaluation. This paper provides a foundational step towards understanding and developing evaluation methodologies suitable for Gen-IR. By leveraging the capabilities of LLMs within a structured evaluative framework, it opens avenues for more sophisticated, nuanced, and accurate assessments of generative information retrieval systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Crowdsourcing for relevance evaluation. SIGIR Forum 42, 2 (November 2008), 9–15.
  2. Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers. In 46th European Conference on Information Retrieval. Glasgow, Scotland.
  3. Quantifying ranker coverage of different query subspaces. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2298–2302.
  4. Negar Arabzadeh and Charles LA Clarke. 2024. Fr\\\backslash\’echet Distance for Offline Evaluation of Information Retrieval Systems with Sparse Labels. arXiv preprint arXiv:2401.17543 (2024).
  5. A is for adele: An offline evaluation metric for instant search. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 3–12.
  6. Shallow pooling for sparse labels. Information Retrieval Journal 25, 4 (2022), 365–385.
  7. Relevance assessment: Are judges exchangeable and does it matter. In 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Singapore, 667–674.
  8. Gen-IR@SIGIR 2023: The First Workshop on Generative Information Retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3460–3463.
  9. Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models. (2023). arXiv:cs.CL/2212.08037
  10. Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
  11. Here or there: Preference Judgments for Relevance. Computer Science Department Faculty Publication Series 46. University of Massachusetts Amherst.
  12. Continual Learning for Generative Retrieval over Dynamic Corpora. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 306–315.
  13. A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1448–1457.
  14. The First Workshop on Personalized Generative AI @ CIKM 2023: Personalization Meets Large Language Models. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 5267–5270.
  15. Deep reinforcement learning from human preferences. (2023). arXiv:stat.ML/1706.03741
  16. Overview of the TREC 2010 Web Track. In 19th Text REtrieval Conference. Gaithersburg, Maryland.
  17. Assessing top-k𝑘kitalic_k preferences. ACM Transactions on Information Systems 39, 3 (July 2021).
  18. Cyril W. Cleverdon. 1991. The significance of the Cranfield tests on index languages. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 3–12.
  19. Overview of the TREC 2019 Deep Learning Track. In 28th Text REtrieval Conference. Gaithersburg, Maryland.
  20. Overview of the TREC 2020 Deep Learning Track. In 29th Text REtrieval Conference. Gaithersburg, Maryland.
  21. MS MARCO: Benchmarking Ranking Models in the Large-Data Regime. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1566–1576.
  22. TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2369–2375.
  23. Perspectives on Large Language Models for Relevance Judgment. In ACM SIGIR International Conference on Theory of Information Retrieval. 39–50.
  24. FactKB: Generalizable Factuality Evaluation using Language Models Enhanced with Factual Knowledge. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 933–952.
  25. Evaluating Generative Ad Hoc Information Retrieval. (2023). arXiv:cs.IR/2311.04694
  26. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences 120, 30 (July 2023). https://doi.org/10.1073/pnas.2305016120
  27. Retrieving Supporting Evidence for Generative Question Answering. (December 2023).
  28. Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20, 4 (2002), 422–446.
  29. Active Retrieval Augmented Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 7969–7992.
  30. User Intent and Assessor Disagreement in Web Search Evaluation. In 22nd ACM International Conference on Information and Knowledge Management. San Francisco, California, 699–708.
  31. Matthew Lease and Emine Yilmaz. 2012. Crowdsourcing for information retrieval. SIGIR Forum 45, 2 (January 2012), 66–75.
  32. Evaluating Verifiability in Generative Search Engines. (2023). arXiv:cs.CL/2304.09848
  33. WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4549–4560.
  34. Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2230–2235.
  35. On crowdsourcing relevance magnitudes for information retrieval evaluation. ACM Transactions on Information Systems 35, 3 (January 2017).
  36. Rethinking search: Making domain experts out of dilettantes. SIGIR Forum 55, 1, Article 13 (July 2021), 27 pages.
  37. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
  38. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems. 27730–27744.
  39. Nikita Pavlichenko and Dmitry Ustalov. 2022. Best Prompts for Text-to-Image Models and How to Find Them. arXiv preprint arXiv:2209.11711 (2022).
  40. How Does Generative Retrieval Scale to Millions of Passages?. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 1305–1321.
  41. Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. (2023). arXiv:cs.IR/2306.17563
  42. A nugget-based test collection construction paradigm. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 1945–1948.
  43. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910 (2020).
  44. Tetsuya Sakai and Zhaohao Zeng. 2020. Good evaluation measures based on document preferences. In 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 359–368.
  45. WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia. In Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore, 2387–2413.
  46. REPLUG: Retrieval-Augmented Black-Box Language Models. (2023). arXiv:cs.CL/2301.12652
  47. Learning to Tokenize for Generative Retrieval. (2023). arXiv:cs.IR/2304.04171
  48. Large language models can accurately predict searcher preferences. (2023). arXiv:cs.IR/2309.10621
  49. LLaMA: Open and Efficient Foundation Language Models. (2023). arXiv:cs.CL/2302.13971
  50. Ellen M. Voorhees. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. In 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Melbourne, Australia, 315–323.
  51. Ellen M. Voorhees. 2018. On Building Fair and Reusable Test Collections using Bandit Techniques. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 407–416.
  52. Can Old TREC Collections Reliably Evaluate Modern Neural Retrieval Models? (2022). arXiv:cs.IR/2201.11086
  53. Preference-based evaluation metrics for web image search. In 43st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Xi’an, China.
  54. Auto Search Indexer for End-to-End Document Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.).
  55. Generate rather than Retrieve: Large Language Models are Strong Context Generators. (2023). arXiv:cs.CL/2209.10063
  56. Conversational Information Seeking. (2023). arXiv:cs.IR/2201.08808
  57. Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. 10–17.
  58. BERTScore: Evaluating Text Generation with BERT. (2020). arXiv:cs.CL/1904.09675
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Negar Arabzadeh (28 papers)
  2. Charles L. A. Clarke (30 papers)
Citations (9)