Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMs Can Patch Up Missing Relevance Judgments in Evaluation (2405.04727v1)

Published 8 May 2024 in cs.IR

Abstract: Unjudged documents or holes in information retrieval benchmarks are considered non-relevant in evaluation, yielding no gains in measuring effectiveness. However, these missing judgments may inadvertently introduce biases into the evaluation as their prevalence for a retrieval model is heavily contingent on the pooling process. Thus, filling holes becomes crucial in ensuring reliable and accurate evaluation. Collecting human judgment for all documents is cumbersome and impractical. In this paper, we aim at leveraging LLMs to automatically label unjudged documents. Our goal is to instruct an LLM using detailed instructions to assign fine-grained relevance judgments to holes. To this end, we systematically simulate scenarios with varying degrees of holes by randomly dropping relevant documents from the relevance judgment in TREC DL tracks. Our experiments reveal a strong correlation between our LLM-based method and ground-truth relevance judgments. Based on our simulation experiments conducted on three TREC DL datasets, in the extreme scenario of retaining only 10% of judgments, our method achieves a Kendall tau correlation of 0.87 and 0.92 on an average for Vicu~na-7B and GPT-3.5 Turbo respectively.

Exploring the Use of LLMs for Filling "Holes" in Information Retrieval Benchmarks

Background on Information Retrieval Evaluation

Information Retrieval (IR) is a field that focuses on the retrieval of relevant information from large datasets. For effective evaluation, benchmarks typically use a set of queries against a text corpus and check how well a retrieval model can identify relevant documents. However, due to growing data sizes and complexity, it's impractical to assess every document's relevance manually, leading to unjudged documents, or "holes". These holes can potentially introduce biases when they are automatically considered as non-relevant because they affect how retrieval models are scored.

The Problem of Holes

The dilemma revolves around the limitations in traditional evaluation methods that cannot feasibly handle the sheer scale of documents. Recent strategies involve automated methods to plug these gaps to maintain fairness and accuracy in system evaluations. Such methods are crucial since these holes can misrepresent the capabilities of newer or diverse retrieval models, potentially stifling innovation and progress in IR systems by skewing results towards familiar, well-trodden models.

Our Innovative Approach Using LLMs

The paper presents an intriguing methodology to use LLMs to automatically assign relevance judgments to the unjudged documents in IR benchmarks. This approach leverages the instructional abilities of models such as Vicuña-7B and Turbo to interpret and classify the relevance of text data concerning a specific query.

Core Methodology:

  • Scenario Simulation: By simulating different scenarios where relevant documents are randomly removed from the dataset, a realistic and stringent test environment is created.
  • Training LLMs: The LLMs are provided with specific examples and detailed guidance on assigning relevance levels (from completely irrelevant to perfectly relevant).
  • Accuracy and Validation: Comparatively validating the LLM-generated judgments against the original "ground-truth" judgments to measure accuracy.

Experimental Insights and Results

The experiments demonstrated not only the feasibility of this approach but also its robustness. Particularly:

  • In the most extreme simulation of retaining only 10% of human judgments, the model achieved a high Kendall τ correlation between 0.87 and 0.92, reflecting strong agreement with human assessments using Vicuña-7B and Turbo.
  • These results highlight that LLMs, both proprietary and open-source, can successfully extrapolate from limited data to make accurate relevance judgments.

Implications and Future Directions

Practical Implications:

This method could dramatically reduce the human labor and time required in generating IR benchmarks, leading to more frequent updates and potentially more accurate and fair evaluations of retrieval systems.

Theoretical Implications:

The success of LLMs in this arena supports the hypothesis that they can comprehend and process complex, abstract instructions in specialized domains, adapting their general linguistic capabilities to specific tasks.

Speculation on Future AI Developments:

  • Automated Evaluation Systems: Fully automated systems could become standard for initial evaluations in IR benchmarks.
  • Expansion Across Fields: This methodology has potential applications in other areas requiring content evaluation, like content moderation or recommendation systems.
  • Improving LLM Training: Future research might focus on refining LLM training methodologies to enhance their sensitivity to the nuances of relevance judgment without direct human input.

Conclusion

The use of LLMs to fill evaluation holes in IR benchmarks represents a significant step towards more scalable, accurate, and unbiased systems. By effectively harnessing the capabilities of LLMs, researchers can ensure that advancements in the field are truly reflective of a model's capacity to retrieve relevant information, not just its ability to play to the test's benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Open-Source Large Language Models Outperform Crowd Workers and Approach ChatGPT in Text-Annotation Tasks. arXiv preprint arXiv:2307.02179 (2023).
  2. Shallow Pooling for Sparse Labels. Information Retrieval Journal 25, 4 (2022), 365–385.
  3. Improvements that don’t Add up: Ad-hoc Retrieval Results Since 1998. In Proceedings of the 18th ACM conference on Information and knowledge management (Hong Kong, China). Association for Computing Machinery, 601–610.
  4. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv preprint arXiv:2204.05862 (2022). arXiv:2204.05862
  5. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models. Association for Computational Linguistics, virtual+Dublin, 95–136.
  6. Language Models are Few-Shot Learners. arXiv:2005.14165
  7. Chris Buckley and Ellen M. Voorhees. 2004. Retrieval Evaluation with Incomplete Information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Sheffield, United Kingdom) (SIGIR ’04). Association for Computing Machinery, New York, NY, USA, 25–32.
  8. C. Cleverdon. 1967. The Cranfield Tests On Index Language Devices. Aslib Proceedings 19, 6, 173–194.
  9. Overview of the TREC 2020 Deep Learning Track. arXiv:2102.07662
  10. Overview of the TREC 2021 Deep Learning Track. In Text REtrieval Conference (TREC). NIST, TREC.
  11. Overview of the TREC 2019 Deep Learning Track. arXiv:2003.07820
  12. Wikimarks: Harvesting Relevance Benchmarks from Wikipedia. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain). Association for Computing Machinery, 3003–3012.
  13. Laura Dietz and Jeff Dalton. 2020. Humans Optional? Automatic Large-Scale Test Collections for Entity, Passage, and Entity-Passage Retrieval. Datenbank-Spektrum 20, 1 (2020), 17–28.
  14. Perspectives on Large Language Models for Relevance Judgment. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval (Taipei, Taiwan) (ICTIR ’23). Association for Computing Machinery, New York, NY, USA, 39–50.
  15. Noise-Reduction for Automatically Transferred Relevance Judgments. In International Conference of the Cross-Language Evaluation Forum for European Languages (Bologna, Italy). Springer-Verlag, 48–61.
  16. Bootstrapped nDCG Estimation in the Presence of Unjudged Documents. In European Conference on Information Retrieval. Springer, 313–329.
  17. Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (Taipei, Taiwan) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 2230–2235.
  18. OpenAI. 2023. GPT-4 Technical Report. Technical Report.
  19. Training Language Models to Follow Instructions with Human Feedback. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 27730–27744.
  20. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 5835–5847.
  21. Filip Radlinski and Nick Craswell. 2010. Comparing the Sensitivity of Information Retrieval Metrics. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Geneva, Switzerland) (SIGIR ’10). Association for Computing Machinery, New York, NY, USA, 667–674.
  22. Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4 (apr 2009), 333–389.
  23. Can GPT-4 Support Analysis of Textual Data in Tasks Requiring Highly Specialized Domain Expertise? arXiv preprint arXiv:2306.13906 (2023).
  24. Large Language Models can Accurately Predict Searcher Preferences. arXiv preprint arXiv:2309.10621 (2023).
  25. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288 (2023). arXiv:2307.09288
  26. Critically Examining the "Neural Hype": Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Paris, France) (SIGIR’19). Association for Computing Machinery, New York, NY, USA, 1129–1132.
  27. A Simple and Efficient Sampling Method for Estimating AP and NDCG. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, Singapore) (SIGIR ’08). Association for Computing Machinery, New York, NY, USA, 603–610.
  28. OPT: Open Pre-Trained Transformer Language Models. arXiv preprint arXiv:2205.01068 (2022). arXiv:2205.01068
  29. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685
  30. Can ChatGPT Reproduce Human-Generated Labels? a Study of Social Computing Tasks. arXiv preprint arXiv:2304.10145 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Shivani Upadhyay (9 papers)
  2. Ehsan Kamalloo (17 papers)
  3. Jimmy Lin (208 papers)
Citations (10)