Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

One-Shot Labeling for Automatic Relevance Estimation (2302.11266v2)

Published 22 Feb 2023 in cs.IR

Abstract: Dealing with unjudged documents ("holes") in relevance assessments is a perennial problem when evaluating search systems with offline experiments. Holes can reduce the apparent effectiveness of retrieval systems during evaluation and introduce biases in models trained with incomplete data. In this work, we explore whether LLMs can help us fill such holes to improve offline evaluations. We examine an extreme, albeit common, evaluation setting wherein only a single known relevant document per query is available for evaluation. We then explore various approaches for predicting the relevance of unjudged documents with respect to a query and the known relevant document, including nearest neighbor, supervised, and prompting techniques. We find that although the predictions of these One-Shot Labelers (1SL) frequently disagree with human assessments, the labels they produce yield a far more reliable ranking of systems than the single labels do alone. Specifically, the strongest approaches can consistently reach system ranking correlations of over 0.86 with the full rankings over a variety of measures. Meanwhile, the approach substantially increases the reliability of t-tests due to filling holes in relevance assessments, giving researchers more confidence in results they find to be significant. Alongside this work, we release an easy-to-use software package to enable the use of 1SL for evaluation of other ad-hoc collections or systems.

Overview of One-Shot Labeling for Automatic Relevance Estimation

The paper "One-Shot Labeling for Automatic Relevance Estimation" by MacAvaney and Soldaini addresses a significant challenge in the evaluation of information retrieval systems—dealing with unjudged documents, commonly referred to as "holes" in relevance assessments. This challenge arises especially in offline experiments where the costs associated with fully judged test collections are often infeasible. The authors investigate whether LLMs can effectively fill these gaps, focusing on an extreme evaluation setting where only a single known relevant document per query is available.

Problem Context

In traditional information retrieval experiments, test collections are created wherein selected documents are judged for relevance concerning a set of queries. These judgments, however, are often incomplete due to the sheer volume of documents that need to be assessed, leading to biases and inaccuracies in evaluation results. Common strategies like shallow pools often lead to incomplete assessments that may affect systematic evaluations and the ability to reuse the collections for new systems. The authors propose an alternative approach using machine learning models to predict relevance for the unjudged documents, thereby aiming to enhance both the reliability and efficiency of offline evaluations.

Proposed Methods

The paper explores several "One-Shot Labelers" (1) methods for predicting relevance given a single known relevant document:

  • MaxRep: This method identifies the k-nearest neighbors of the known relevant document using both lexical (BM25) and semantic (TCT-ColBERT) similarities and treats them as relevant with a linear gain degradation.
  • DuoT5: A sequence-to-sequence model designed to evaluate the relative relevance of two documents with respect to a query, which is adapted here for one-shot relevance estimation based on comparisons.
  • DuoPrompt: Utilizes instruction-tuned models like Flan-T5 which can be prompted to perform the relevance estimation task directly with formulated instructions.

Results

Empirical evaluations using TREC Deep Learning Track datasets from 2019 to 2021 revealed that the proposed methods consistently achieved a high correlation with the full system rankings of the datasets. Notably, DuoPrompt demonstrated robust performance across several recall-agnostic measures like Precision and RBP, with system ranking correlations regularly surpassing 0.86. Despite these promising results, the methods were not reliable for exhaustive relevance estimation when considering recall measures. Furthermore, the one-shot labelers yielded more reliable statistical significance tests, addressing biases that arise from incomplete data.

Implications and Future Directions

The introduction of one-shot labeling techniques presents an opportunity to reduce the dependency on expensive manual labeling by leveraging the predictive powers of LLMs. This approach is likely to be particularly beneficial for precision-oriented evaluation measures, offering a practical alternative for evaluation settings with limited availability of human-assessed relevance labels.

Future work could address several open ends identified in this paper: Firstly, expanding the application scope of these methods to document retrieval tasks, which pose additional challenges due to larger context windows. Secondly, tackling multi-grade relevance assessments and improving the aggregation techniques for queries with multiple known relevant documents. Furthermore, extending these approaches to recall-sensitive measures without introducing biases is also a critical area for exploration.

In conclusion, the paper by MacAvaney and Soldaini demonstrates the potential for one-shot labeling to significantly enhance the evaluation of information retrieval systems by effectively filling relevance assessment holes. This work represents a step forward in efficiently managing the evaluation costs and ensuring more reliable retrieval system assessments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Shallow pooling for sparse labels. Inf. Retr. J. 25, 4 (2022), 365–385. https://doi.org/10.1007/s10791-022-09411-0
  2. Javed A. Aslam and Robert Savell. 2003. On the effectiveness of evaluating retrieval systems in the absence of relevance judgments. In SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 28 - August 1, 2003, Toronto, Canada, Charles L. A. Clarke, Gordon V. Cormack, Jamie Callan, David Hawking, and Alan F. Smeaton (Eds.). ACM, 361–362. https://doi.org/10.1145/860435.860501
  3. Deciding on an adjustment for multiplicity in IR experiments. In The 36th International ACM SIGIR conference on research and development in Information Retrieval, SIGIR ’13, Dublin, Ireland - July 28 - August 01, 2013, Gareth J. F. Jones, Paraic Sheridan, Diane Kelly, Maarten de Rijke, and Tetsuya Sakai (Eds.). ACM, 403–412. https://doi.org/10.1145/2484028.2484034
  4. Chris Buckley and Ellen M. Voorhees. 2004. Retrieval evaluation with incomplete information. In SIGIR 2004: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, July 25-29, 2004, Mark Sanderson, Kalervo Järvelin, James Allan, and Peter Bruza (Eds.). ACM, 25–32. https://doi.org/10.1145/1008992.1009000
  5. Reliable information retrieval evaluation with incomplete and biased judgements. In SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007, Wessel Kraaij, Arjen P. de Vries, Charles L. A. Clarke, Norbert Fuhr, and Noriko Kando (Eds.). ACM, 63–70. https://doi.org/10.1145/1277741.1277755
  6. Ben Carterette and James Allan. 2007. Semiautomatic evaluation of retrieval systems using document similarities. In Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM 2007, Lisbon, Portugal, November 6-10, 2007, Mário J. Silva, Alberto H. F. Laender, Ricardo A. Baeza-Yates, Deborah L. McGuinness, Bjørn Olstad, Øystein Haug Olsen, and André O. Falcão (Eds.). ACM, 873–876. https://doi.org/10.1145/1321440.1321564
  7. Scaling Instruction-Finetuned Language Models. CoRR abs/2210.11416 (2022). https://doi.org/10.48550/arXiv.2210.11416 arXiv:2210.11416
  8. Cyril W. Cleverdon. 1967. The Cranfield tests on index language devices. Aslib Proceedings 19, 6 (1967), 173–194. https://doi.org/10.1108/eb050097
  9. Overview of the TREC 2020 Deep Learning Track. In Proceedings of the Twenty-Ninth Text REtrieval Conference, TREC 2020, Virtual Event [Gaithersburg, Maryland, USA], November 16-20, 2020 (NIST Special Publication, Vol. 1266), Ellen M. Voorhees and Angela Ellis (Eds.). National Institute of Standards and Technology (NIST). https://trec.nist.gov/pubs/trec29/papers/OVERVIEW.DL.pdf
  10. Overview of the TREC 2021 Deep Learning Track. In Proceedings of the Thirtieth Text REtrieval Conference, TREC 2021, Virtual Event [Gaithersburg, Maryland, USA], November, 2021 (NIST Special Publication), Ian Soboroff and Angela Ellis (Eds.). National Institute of Standards and Technology (NIST). https://trec.nist.gov/pubs/trec30/papers/Overview-DL.pdf
  11. Overview of the TREC 2019 Deep Learning Track. In Proceedings of the Twenty-Eighth Text REtrieval Conference, TREC 2019, Gaithersburg, Maryland, USA, November 13-15, 2019 (NIST Special Publication, Vol. 1250), Ellen M. Voorhees and Angela Ellis (Eds.). National Institute of Standards and Technology (NIST). https://trec.nist.gov/pubs/trec28/papers/OVERVIEW.DL.pdf
  12. Perspectives on Large Language Models for Relevance Judgment. CoRR abs/2304.09161 (2023). https://doi.org/10.48550/arXiv.2304.09161 arXiv:2304.09161
  13. Bootstrapped nDCG Estimation in the Presence of Unjudged Documents. In Advances in Information Retrieval - 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2-6, 2023, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 13980), Jaap Kamps, Lorraine Goeuriot, Fabio Crestani, Maria Maistro, Hideo Joho, Brian Davis, Cathal Gurrin, Udo Kruschwitz, and Annalina Caputo (Eds.). Springer, 313–329. https://doi.org/10.1007/978-3-031-28244-7_20
  14. Prashansa Gupta and Sean MacAvaney. 2022. On Survivorship Bias in MS MARCO. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, Enrique Amigó, Pablo Castells, Julio Gonzalo, Ben Carterette, J. Shane Culpepper, and Gabriella Kazai (Eds.). ACM, 2214–2219. https://doi.org/10.1145/3477495.3531832
  15. A Case for Automatic System Evaluation. In Advances in Information Retrieval, 32nd European Conference on IR Research, ECIR 2010, Milton Keynes, UK, March 28-31, 2010. Proceedings (Lecture Notes in Computer Science, Vol. 5993), Cathal Gurrin, Yulan He, Gabriella Kazai, Udo Kruschwitz, Suzanne Little, Thomas Roelleke, Stefan M. Rüger, and Keith van Rijsbergen (Eds.). Springer, 153–165. https://doi.org/10.1007/978-3-642-12275-0_16
  16. Kai Hui and Klaus Berberich. 2015. Selective Labeling and Incomplete Label Mitigation for Low-Cost Evaluation. In String Processing and Information Retrieval - 22nd International Symposium, SPIRE 2015, London, UK, September 1-4, 2015, Proceedings (Lecture Notes in Computer Science, Vol. 9309), Costas S. Iliopoulos, Simon J. Puglisi, and Emine Yilmaz (Eds.). Springer, 137–148. https://doi.org/10.1007/978-3-319-23826-5_14
  17. Kai Hui and Klaus Berberich. 2016. Cluster Hypothesis in Low-Cost IR Evaluation with Different Document Representations. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11-15, 2016, Companion Volume, Jacqueline Bourdeau, Jim Hendler, Roger Nkambou, Ian Horrocks, and Ben Y. Zhao (Eds.). ACM, 47–48. https://doi.org/10.1145/2872518.2889370
  18. Dealing with Incomplete Judgments in Cascade Measures. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR 2017, Amsterdam, The Netherlands, October 1-4, 2017, Jaap Kamps, Evangelos Kanoulas, Maarten de Rijke, Hui Fang, and Emine Yilmaz (Eds.). ACM, 83–90. https://doi.org/10.1145/3121050.3121064
  19. N. Jardine and Cornelis Joost van Rijsbergen. 1971. The use of hierarchic clustering in information retrieval. Inf. Storage Retr. 7, 5 (1971), 217–240. https://doi.org/10.1016/0020-0271(71)90051-9
  20. In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP, RepL4NLP@ACL-IJCNLP 2021, Online, August 6, 2021, Anna Rogers, Iacer Calixto, Ivan Vulic, Naomi Saphra, Nora Kassner, Oana-Maria Camburu, Trapit Bansal, and Vered Shwartz (Eds.). Association for Computational Linguistics, 163–173. https://doi.org/10.18653/v1/2021.repl4nlp-1.17
  21. What Makes Good In-Context Examples for GPT-3?. In Proceedings of Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, DeeLIO@ACL 2022, Dublin, Ireland and Online, May 27, 2022, Eneko Agirre, Marianna Apidianaki, and Ivan Vulic (Eds.). Association for Computational Linguistics, 100–114. https://doi.org/10.18653/v1/2022.deelio-1.10
  22. Streamlining Evaluation with ir-measures. In Advances in Information Retrieval - 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10-14, 2022, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 13186), Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.). Springer, 305–310. https://doi.org/10.1007/978-3-030-99739-7_38
  23. Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-Shot Learning. In Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12036), Joemon M. Jose, Emine Yilmaz, João Magalhães, Pablo Castells, Nicola Ferro, Mário J. Silva, and Flávio Martins (Eds.). Springer, 246–254. https://doi.org/10.1007/978-3-030-45442-5_31
  24. Adaptive Re-Ranking with a Corpus Graph. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, October 17-21, 2022, Mohammad Al Hasan and Li Xiong (Eds.). ACM, 1491–1500. https://doi.org/10.1145/3511808.3557231
  25. Automatic Ground Truth Expansion for Timeline Evaluation. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018, Kevyn Collins-Thompson, Qiaozhu Mei, Brian D. Davison, Yiqun Liu, and Emine Yilmaz (Eds.). ACM, 685–694. https://doi.org/10.1145/3209978.3210034
  26. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, 11048–11064. https://aclanthology.org/2022.emnlp-main.759
  27. Incorporating User Expectations and Behavior into the Measurement of Search Effectiveness. ACM Trans. Inf. Syst. 35, 3 (2017), 24:1–24:38. https://doi.org/10.1145/3052768
  28. Strategic system comparisons via targeted relevance judgments. In SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007, Wessel Kraaij, Arjen P. de Vries, Charles L. A. Clarke, Norbert Fuhr, and Noriko Kando (Eds.). ACM, 375–382. https://doi.org/10.1145/1277741.1277806
  29. Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27, 1 (2008), 2:1–2:27. https://doi.org/10.1145/1416950.1416952
  30. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016 (CEUR Workshop Proceedings, Vol. 1773), Tarek Richard Besold, Antoine Bordes, Artur S. d’Avila Garcez, and Greg Wayne (Eds.). CEUR-WS.org. https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf
  31. Rabia Nuray and Fazli Can. 2006. Automatic ranking of information retrieval systems using data fusion. Inf. Process. Manag. 42, 3 (2006), 595–614. https://doi.org/10.1016/j.ipm.2005.03.023
  32. The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models. CoRR abs/2101.05667 (2021). arXiv:2101.05667 https://arxiv.org/abs/2101.05667
  33. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.html
  34. TripClick: The Log Files of a Large Health Web Search Engine. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, Fernando Diaz, Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones, and Tetsuya Sakai (Eds.). ACM, 2507–2513. https://doi.org/10.1145/3404835.3463242
  35. Effectiveness evaluation without human relevance judgments: A systematic analysis of existing methods and of their combinations. Inf. Process. Manag. 57, 2 (2020), 102149. https://doi.org/10.1016/j.ipm.2019.102149
  36. Tetsuya Sakai. 2007. Alternatives to Bpref. In SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007, Wessel Kraaij, Arjen P. de Vries, Charles L. A. Clarke, Norbert Fuhr, and Noriko Kando (Eds.). ACM, 71–78. https://doi.org/10.1145/1277741.1277756
  37. Tetsuya Sakai and Noriko Kando. 2008. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf. Retr. 11, 5 (2008), 447–470. https://doi.org/10.1007/s10791-008-9059-7
  38. David P. Sander and Laura Dietz. 2021. EXAM: How to Evaluate Retrieve-and-Generate Systems for Users Who Do Not (Yet) Know What They Want. In Proceedings of the Second International Conference on Design of Experimental Search & Information REtrieval Systems, Padova, Italy, September 15-18, 2021 (CEUR Workshop Proceedings, Vol. 2950), Omar Alonso, Stefano Marchesin, Marc Najork, and Gianmaria Silvello (Eds.). CEUR-WS.org, 136–146. https://ceur-ws.org/Vol-2950/paper-16.pdf
  39. Multitask Prompted Training Enables Zero-Shot Task Generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=9Vrb9D0WI4
  40. Ranking Retrieval Systems without Relevance Judgments. In SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, September 9-13, 2001, New Orleans, Louisiana, USA, W. Bruce Croft, David J. Harper, Donald H. Kraft, and Justin Zobel (Eds.). ACM, 66–73. https://doi.org/10.1145/383952.383961
  41. Karen Sparck-Jones and C. J. Van Rijsbergen. 1975. Report on the need for and provision of an ’ideal’ information retrieval test collection. Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5266 (1975). https://sigir.org/files/museum/pub-14/pub_14.pdf
  42. Anselm Spoerri. 2007. Using the structure of overlap between search results to rank retrieval systems without relevance judgments. Inf. Process. Manag. 43, 4 (2007), 1059–1070. https://doi.org/10.1016/j.ipm.2006.09.009
  43. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. CoRR abs/2104.08663 (2021). arXiv:2104.08663 https://arxiv.org/abs/2104.08663
  44. Ellen M. Voorhees. 2001. The Philosophy of Information Retrieval Evaluation. In Evaluation of Cross-Language Information Retrieval Systems, Second Workshop of the Cross-Language Evaluation Forum, CLEF 2001, Darmstadt, Germany, September 3-4, 2001, Revised Papers (Lecture Notes in Computer Science, Vol. 2406), Carol Peters, Martin Braschler, Julio Gonzalo, and Michael Kluck (Eds.). Springer, 355–370. https://doi.org/10.1007/3-540-45691-0_34
  45. Too Many Relevants: Whither Cranfield Test Collections?. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, Enrique Amigó, Pablo Castells, Julio Gonzalo, Ben Carterette, J. Shane Culpepper, and Gabriella Kazai (Eds.). ACM, 2970–2980. https://doi.org/10.1145/3477495.3531728
  46. Can Old TREC Collections Reliably Evaluate Modern Neural Retrieval Models? CoRR abs/2201.11086 (2022). arXiv:2201.11086
  47. Thuy Vu and Alessandro Moschitti. 2021. AVA: an Automatic eValuation Approach for Question Answering Systems. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, 5223–5233. https://doi.org/10.18653/v1/2021.naacl-main.412
  48. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, 5085–5109. https://aclanthology.org/2022.emnlp-main.340
  49. Finetuned Language Models are Zero-Shot Learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=gEZrGCozdqR
  50. Shengli Wu and Fabio Crestani. 2002. Data fusion with estimated weights. In Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, November 4-9, 2002. ACM, 648–651. https://doi.org/10.1145/584792.584908
  51. Shengli Wu and Fabio Crestani. 2003. Methods for Ranking Information Retrieval Systems Without Relevance Judgments. In Proceedings of the 2003 ACM Symposium on Applied Computing (SAC), March 9-12, 2003, Melbourne, FL, USA, Gary B. Lamont, Hisham Haddad, George A. Papadopoulos, and Brajendra Panda (Eds.). ACM, 811–816. https://doi.org/10.1145/952532.952693
  52. Emine Yilmaz and Javed A. Aslam. 2006. Estimating average precision with incomplete and imperfect judgments. In Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, November 6-11, 2006, Philip S. Yu, Vassilis J. Tsotras, Edward A. Fox, and Bing Liu (Eds.). ACM, 102–111. https://doi.org/10.1145/1183614.1183633
  53. Calibrate Before Use: Improving Few-shot Performance of Language Models. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 12697–12706. http://proceedings.mlr.press/v139/zhao21c.html
  54. Justin Zobel. 1998. How Reliable Are the Results of Large-Scale Information Retrieval Experiments?. In SIGIR ’98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 24-28 1998, Melbourne, Australia, W. Bruce Croft, Alistair Moffat, C. J. van Rijsbergen, Ross Wilkinson, and Justin Zobel (Eds.). ACM, 307–314. https://doi.org/10.1145/290941.291014
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Sean MacAvaney (75 papers)
  2. Luca Soldaini (62 papers)
Citations (39)
Github Logo Streamline Icon: https://streamlinehq.com