Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Surprising Effectiveness of Rankers Trained on Expanded Queries (2404.02587v2)

Published 3 Apr 2024 in cs.IR and cs.AI

Abstract: An important problem in text-ranking systems is handling the hard queries that form the tail end of the query distribution. The difficulty may arise due to the presence of uncommon, underspecified, or incomplete queries. In this work, we improve the ranking performance of hard or difficult queries without compromising the performance of other queries. Firstly, we do LLM based query enrichment for training queries using relevant documents. Next, a specialized ranker is fine-tuned only on the enriched hard queries instead of the original queries. We combine the relevance scores from the specialized ranker and the base ranker, along with a query performance score estimated for each query. Our approach departs from existing methods that usually employ a single ranker for all queries, which is biased towards easy queries, which form the majority of the query distribution. In our extensive experiments on the DL-Hard dataset, we find that a principled query performance based scoring method using base and specialized ranker offers a significant improvement of up to 25% on the passage ranking task and up to 48.4% on the document ranking task when compared to the baseline performance of using original queries, even outperforming SOTA model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Query difficulty, robustness, and selective application of query expansion. In S. McDonald and J. Tait, editors, Advances in Information Retrieval, pages 127–137, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg. ISBN 978-3-540-24752-4.
  2. Supervised contrastive learning approach for contextual ranking. In Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval, pages 61–71, 2022.
  3. Data augmentation for sample efficient and robust document ranking. ACM Transactions on Information Systems, 2023.
  4. Matches made in heaven: Toolkit and large-scale datasets for supervised query reformulation. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management, pages 4417–4425, 2021a.
  5. Bert-qpp: Contextualized pre-trained transformers for query performance prediction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, page 2857–2861, New York, NY, USA, 2021b. Association for Computing Machinery. ISBN 9781450384469. doi: 10.1145/3459637.3482063. URL https://doi.org/10.1145/3459637.3482063.
  6. Ms marco chameleons: Challenging the ms marco leaderboard with extremely obstinate queries. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, page 4426–4435, New York, NY, USA, 2021c. Association for Computing Machinery. ISBN 9781450384469. doi: 10.1145/3459637.3482011. URL https://doi.org/10.1145/3459637.3482011.
  7. Inpars: Data augmentation for information retrieval using large language models. arXiv preprint arXiv:2202.05144, 2022.
  8. What makes a query difficult? In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, page 390–397, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933697. doi: 10.1145/1148170.1148238. URL https://doi.org/10.1145/1148170.1148238.
  9. C. Carpineto and G. Romano. A survey of automatic query expansion in information retrieval. ACM Comput. Surv., 44(1), jan 2012. ISSN 0360-0300. doi: 10.1145/2071389.2071390. URL https://doi.org/10.1145/2071389.2071390.
  10. D. R. Cheriton. From doc2query to doctttttquery. 2019.
  11. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
  12. A framework for selective query expansion. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, CIKM ’04, page 236–237, New York, NY, USA, 2004. Association for Computing Machinery. ISBN 1581138741. doi: 10.1145/1031171.1031220. URL https://doi.org/10.1145/1031171.1031220.
  13. Promptagator: Few-shot dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755, 2022.
  14. Deep-qpp: A pairwise interaction-based deep learning model for supervised query performance prediction. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM ’22, page 201–209, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450391320. doi: 10.1145/3488560.3498491. URL https://doi.org/10.1145/3488560.3498491.
  15. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
  16. Combination of multiple searches. In Text Retrieval Conference, 1993. URL https://api.semanticscholar.org/CorpusID:1309301.
  17. L. Gallagher. Pairwise t-test on TREC Run Files. https://github.com/lgrz/pairwise-ttest/, 2019.
  18. Doc2query–: When less is more, 2023.
  19. Context-and content-aware embeddings for query rewriting in sponsored search. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pages 383–392, 2015.
  20. Learning to rewrite queries. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 1443–1452, 2016.
  21. Query expansion by prompting large language models. arXiv preprint arXiv:2305.03653, 2023.
  22. V. Lavrenko and W. B. Croft. Relevance-based language models. In ACM SIGIR Forum, volume 51, pages 260–267. ACM New York, NY, USA, 2017.
  23. Extractive explanations for interpretable text ranking. ACM Trans. Inf. Syst., dec 2022. ISSN 1046-8188. doi: 10.1145/3576924. URL https://doi.org/10.1145/3576924.
  24. Parade: Passage representation aggregation for document reranking. ACM Transactions on Information Systems, 2020.
  25. How deep is your learning: the dl-hard annotated deep learning dataset, 2021.
  26. Query performance prediction: From ad-hoc to conversational search. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, jul 2023. doi: 10.1145/3539618.3591919. URL https://doi.org/10.1145%2F3539618.3591919.
  27. J. Mothe and M. Z. Ullah. Selective query processing: A risk-sensitive selection of search configurations. ACM Trans. Inf. Syst., 42(1), aug 2023. ISSN 1046-8188. doi: 10.1145/3608474. URL https://doi.org/10.1145/3608474.
  28. MS MARCO: A human generated machine reading comprehension dataset. In T. R. Besold, A. Bordes, A. S. d’Avila Garcez, and G. Wayne, editors, Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, volume 1773 of CEUR Workshop Proceedings. CEUR-WS.org, 2016. URL http://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf.
  29. Document expansion by query prediction, 2019. URL https://arxiv.org/abs/1904.08375.
  30. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  31. S. Rao and H. Daumé III. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. arXiv preprint arXiv:1805.04655, 2018.
  32. K. Rudra and A. Anand. Distant supervision in bert-based adhoc document retrieval. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, pages 2197–2200, 2020.
  33. An in-depth analysis of passage-level label transfer for contextual document ranking. Information Retrieval Journal, 26(1):13, 2023.
  34. M. Sanderson. Ambiguous queries: Test collections need more sense. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, page 499–506, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781605581644. doi: 10.1145/1390334.1390420. URL https://doi.org/10.1145/1390334.1390420.
  35. J. Trienes and K. Balog. Identifying unclear questions in community question answering websites. In Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14–18, 2019, Proceedings, Part I 41, pages 276–289. Springer, 2019.
  36. Query2doc: Query expansion with large language models, 2023a.
  37. Generative query reformulation for effective adhoc search. arXiv preprint arXiv:2308.00415, 2023b.
  38. H. Zamani and W. B. Croft. Relevance-based word embedding. In N. Kando, T. Sakai, H. Joho, H. Li, A. P. de Vries, and R. W. White, editors, Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017, pages 505–514. ACM, 2017. doi: 10.1145/3077136.3080831. URL https://doi.org/10.1145/3077136.3080831.
  39. Generating clarifying questions for information retrieval. In Proceedings of the web conference 2020, pages 418–428, 2020.
  40. Information needs, queries, and query performance prediction. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’19, page 395–404, New York, NY, USA, 2019a. Association for Computing Machinery. ISBN 9781450361729. doi: 10.1145/3331184.3331253. URL https://doi.org/10.1145/3331184.3331253.
  41. Information needs, queries, and query performance prediction. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’19, page 395–404, New York, NY, USA, 2019b. Association for Computing Machinery. ISBN 9781450361729. doi: 10.1145/3331184.3331253. URL https://doi.org/10.1145/3331184.3331253.
  42. Brown university at trec deep learning 2019, 2020.
  43. Context-aware query rewriting for improving users’ search experience on e-commerce websites, 2022.

Summary

We haven't generated a summary for this paper yet.