Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods (2402.13350v2)

Published 20 Feb 2024 in cs.CL

Abstract: We present Polish Information Retrieval Benchmark (PIRB), a comprehensive evaluation framework encompassing 41 text information retrieval tasks for Polish. The benchmark incorporates existing datasets as well as 10 new, previously unpublished datasets covering diverse topics such as medicine, law, business, physics, and linguistics. We conduct an extensive evaluation of over 20 dense and sparse retrieval models, including the baseline models trained by us as well as other available Polish and multilingual methods. Finally, we introduce a three-step process for training highly effective language-specific retrievers, consisting of knowledge distillation, supervised fine-tuning, and building sparse-dense hybrid retrievers using a lightweight rescoring model. In order to validate our approach, we train new text encoders for Polish and compare their results with previously evaluated methods. Our dense models outperform the best solutions available to date, and the use of hybrid methods further improves their performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610.
  2. mmarco: A multilingual version of ms marco passage ranking dataset.
  3. Christopher JC Burges. 2010. From RankNet to LambdaRank to LambdaMART: An overview. Learning, 11(23-581):81.
  4. Recent advances in retrieval-augmented text generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3417–3419.
  5. Out-of-domain semantics to the rescue! zero-shot hybrid retrieval models. In European Conference on Information Retrieval, pages 95–110. Springer.
  6. Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794.
  7. Pre-training polish transformer-based language models at scale. In Artificial Intelligence and Soft Computing: 19th International Conference, ICAISC 2020, Zakopane, Poland, October 12-14, 2020, Proceedings, Part II 19, pages 301–314. Springer.
  8. MFAQ: a multilingual FAQ dataset. In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, pages 1–13, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  9. Pre-training methods in information retrieval. Foundations and Trends® in Information Retrieval, 16(3):178–317.
  10. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, Dublin, Ireland. Association for Computational Linguistics.
  11. Splade v2: Sparse lexical and expansion model for information retrieval. ArXiv, abs/2109.10086.
  12. From distillation to hard negative sampling: Making sparse neural ir models more effective. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2353–2359.
  13. Splade: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2288–2292.
  14. Semantic models for the first-stage retrieval: A comprehensive review. ACM Transactions on Information Systems (TOIS), 40(4):1–42.
  15. Efficient natural language response suggestion for smart reply. ArXiv, abs/1705.00652.
  16. Improving efficient neural ranking models with cross-architecture knowledge distillation. arXiv preprint arXiv:2010.02666.
  17. PolEval 2022/23 challenge tasks and results. In 2023 18th Conference on Computer Science and Intelligence Systems (FedCSIS), page 1237–1244. IEEE.
  18. Marek Kozłowski. 2023. Hybrid retrievers with generative re-rankers. In 2023 18th Conference on Computer Science and Intelligence Systems (FedCSIS), page 1265–1270. IEEE.
  19. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  20. Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics, 9:329–345.
  21. Herbert: Efficiently pretrained transformer-based language model for polish. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pages 1–10.
  22. R OpenAI. 2023. GPT-4 technical report. arXiv, pages 2303–08774.
  23. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  24. Recent developments in the national corpus of polish. NLP, Corpus Linguistics, Corpus Based Grammar Research, page 302.
  25. Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  26. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
  27. Piotr Rybak. 2023. Maupqa: Massive automatically-created polish question answering dataset. In Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023), pages 11–16.
  28. Piotr Rybak and Maciej Ogrodniczuk. 2023. Silverretriever: Advancing neural passage retrieval for polish question answering.
  29. Improving question answering performance through manual annotation: Costs, benefits and strategies. arXiv preprint arXiv:2212.08897.
  30. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803.
  31. DGT-TM: A freely available translation memory in 22 languages. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 454–459, Istanbul, Turkey. European Language Resources Association (ELRA).
  32. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran.
  33. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218.
  34. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  36. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
  37. Beir-pl: Zero shot information retrieval benchmark for the polish language. arXiv preprint arXiv:2305.19840.
  38. C-pack: Packaged resources to advance general chinese embedding.
  39. Anserini: Reproducible ranking baselines using lucene. J. Data and Information Quality, 10(4).
  40. Multilingual universal sentence encoder for semantic retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 87–94.
  41. Pretrained transformers for text ranking: Bert and beyond. In Proceedings of the 14th ACM International Conference on web search and data mining, pages 1154–1156.
  42. Mr. TyDi: A multi-lingual benchmark for dense retrieval. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 127–137, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  43. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics, 11:1114–1131.
  44. Dense text retrieval based on pretrained language models: A survey. arXiv preprint arXiv:2211.14876.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Sławomir Dadas (11 papers)
  2. Michał Perełkiewicz (7 papers)
  3. Rafał Poświata (9 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com