Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Best-Case Retrieval Evaluation: Improving the Sensitivity of Reciprocal Rank with Lexicographic Precision (2306.07908v1)

Published 13 Jun 2023 in cs.IR

Abstract: Across a variety of ranking tasks, researchers use reciprocal rank to measure the effectiveness for users interested in exactly one relevant item. Despite its widespread use, evidence suggests that reciprocal rank is brittle when discriminating between systems. This brittleness, in turn, is compounded in modern evaluation settings where current, high-precision systems may be difficult to distinguish. We address the lack of sensitivity of reciprocal rank by introducing and connecting it to the concept of best-case retrieval, an evaluation method focusing on assessing the quality of a ranking for the most satisfied possible user across possible recall requirements. This perspective allows us to generalize reciprocal rank and define a new preference-based evaluation we call lexicographic precision or lexiprecision. By mathematical construction, we ensure that lexiprecision preserves differences detected by reciprocal rank, while empirically improving sensitivity and robustness across a broad set of retrieval and recommendation tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Shallow Pooling for Sparse Labels. Inf. Retr. 25, 4 (dec 2022), 365–385. https://doi.org/10.1007/s10791-022-09411-0
  2. Learning with Sparse and Biased Feedback for Personal Search. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, 5219–5223. https://doi.org/10.24963/ijcai.2018/725
  3. Benjamin A. Carterette. 2012. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans. Inf. Syst. 30, 1, Article 4 (March 2012), 34 pages. https://doi.org/10.1145/2094072.2094076
  4. Pablo Castells and Alistair Moffat. 2022. Offline recommender system evaluation: Challenges and new directions. AI Magazine 43, 2 (2022), 225–238. https://doi.org/10.1002/aaai.12051 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/aaai.12051
  5. Beyond Accuracy: Grounding Evaluation Metrics for Human-Machine Learning Systems. https://github.com/pchandar/beyond-accuracy-tutorial. In Advances in Neural Information Processing Systems.
  6. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management (Hong Kong, China) (CIKM ’09). ACM, New York, NY, USA, 621–630. https://doi.org/10.1145/1645953.1646033
  7. William S. Cooper. 1968. Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. American Documentation 19, 1 (1968), 30–41. https://doi.org/10.1002/asi.5090190108
  8. Fernando Diaz and Andres Ferraro. 2022. Offline Retrieval Evaluation Without Evaluation Metrics. In Proceedings of the 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
  9. Fernando Diaz and Bhaskar Mitra. 2023. Recall, Robustness, and Lexicographic Evaluation. arXiv:2302.11370 [cs.IR]
  10. Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales. IEEE Access 9 (2021), 136182–136216. https://doi.org/10.1109/ACCESS.2021.3116857
  11. Stephen P. Harter. 1992. Psychological relevance and information science. Journal of the American Society for Information Science 43, 9 (1992), 602–615.
  12. Paul B Kantor and Ellen Voorhees. 1997. Report on the TREC Confusion Track. In Proceedings of The Fifth Text REtrieval Conference (TREC-5).
  13. S. Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing 35, 3 (1987), 400–401. https://doi.org/10.1109/TASSP.1987.1165125
  14. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. (November 2016). https://www.microsoft.com/en-us/research/publication/ms-marco-human-generated-machine-reading-comprehension-dataset/
  15. Stephen Robertson. 2008. A New Interpretation of Average Precision. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, Singapore) (SIGIR ’08). Association for Computing Machinery, New York, NY, USA, 689–690. https://doi.org/10.1145/1390334.1390453
  16. Tetsuya Sakai. 2014. Metrics, statistics, tests. In Bridging Between Information Retrieval and Databases - PROMISE Winter School 2013, Revised Tutorial Lectures (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). Springer Verlag, 116–163. https://doi.org/10.1007/978-3-642-54798-0_6 2013 PROMISE Winter School: Bridging Between Information Retrieval and Databases ; Conference date: 04-02-2013 Through 08-02-2013.
  17. Amartya Sen. 1970. Collective Choice and Social Welfare. Holden-Day.
  18. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management (Lisbon, Portugal) (CIKM ’07). Association for Computing Machinery, New York, NY, USA, 623–632. https://doi.org/10.1145/1321440.1321528
  19. On the Robustness and Discriminative Power of Information Retrieval Metrics for Top-N Recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems (Vancouver, British Columbia, Canada) (RecSys ’18). Association for Computing Machinery, New York, NY, USA, 260–268. https://doi.org/10.1145/3240323.3240347
  20. Assessing ranking metrics in top-N recommendation. Information Retrieval Journal 23, 4 (2020), 411–448. https://doi.org/10.1007/s10791-020-09377-x
  21. Too Many Relevants: Whither Cranfield Test Collections?. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 2970–2980. https://doi.org/10.1145/3477495.3531728
Citations (3)

Summary

We haven't generated a summary for this paper yet.