Assessing generalization capability of text ranking models in Polish (2402.14318v1)
Abstract: Retrieval-augmented generation (RAG) is becoming an increasingly popular technique for integrating internal knowledge bases with LLMs. In a typical RAG pipeline, three models are used, responsible for the retrieval, reranking, and generation stages. In this article, we focus on the reranking problem for the Polish language, examining the performance of rerankers and comparing their results with available retrieval models. We conduct a comprehensive evaluation of existing models and those trained by us, utilizing a benchmark of 41 diverse information retrieval tasks for the Polish language. The results of our experiments show that most models struggle with out-of-domain generalization. However, a combination of effective optimization method and a large training dataset allows for building rerankers that are both compact in size and capable of generalization. The best of our models establishes a new state-of-the-art for reranking in the Polish language, outperforming existing models with up to 30 times more parameters.
- arXiv preprint arXiv:1611.09268 (2016)
- In: Proceedings of the 22nd international conference on Machine learning. pp. 89–96 (2005)
- In: Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Odijk, J., Piperidis, S. (eds.) Proceedings of the Thirteenth Language Resources and Evaluation Conference. pp. 4374–4394. European Language Resources Association, Marseille, France (Jun 2022), https://aclanthology.org/2022.lrec-1.466
- In: Artificial Intelligence and Soft Computing: 19th International Conference, ICAISC 2020, Zakopane, Poland, October 12-14, 2020, Proceedings, Part II 19. pp. 301–314. Springer (2020)
- In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 2353–2359 (2022)
- Information Processing & Management 57(6), 102067 (2020)
- In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 6769–6781. Association for Computational Linguistics, Online (Nov 2020), https://aclanthology.org/2020.emnlp-main.550
- In: 2023 18th Conference on Computer Science and Intelligence Systems (FedCSIS). p. 1237–1244. IEEE (2023)
- Advances in Neural Information Processing Systems 33, 9459–9474 (2020)
- arXiv preprint arXiv:2202.01110 (2022)
- arXiv preprint arXiv:2310.08319 (2023)
- In: Babych, B., Kanishcheva, O., Nakov, P., Piskorski, J., Pivovarova, L., Starko, V., Steinberger, J., Yangarber, R., Marcińczuk, M., Pollak, S., Přibáň, P., Robnik-Šikonja, M. (eds.) Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing. pp. 1–10. Association for Computational Linguistics, Kiyv, Ukraine (Apr 2021), https://aclanthology.org/2021.bsnlp-1.1
- arXiv preprint arXiv:1901.04085 (2019)
- In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 708–718. Association for Computational Linguistics, Online (Nov 2020), https://aclanthology.org/2020.findings-emnlp.63
- Foundations and Trends® in Information Retrieval 3(4), 333–389 (2009)
- arXiv preprint arXiv:2212.06121 (2022)
- Rybak, P.: Maupqa: Massive automatically-created polish question answering dataset. In: Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023). pp. 11–16 (2023)
- arXiv preprint arXiv:2212.08897 (2022)
- In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 14918–14937. Association for Computational Linguistics, Singapore (Dec 2023), https://aclanthology.org/2023.emnlp-main.923
- In: Vanschoren, J., Yeung, S. (eds.) Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. vol. 1. Curran (2021), https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/65b9eea6e1cc6bb9f0cd2a47751a186f-Paper-round2.pdf
- arXiv preprint arXiv:2212.03533 (2022)
- arXiv preprint arXiv:2305.19840 (2023)
- arXiv preprint arXiv:2309.07597 (2023)
- In: Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y. (eds.) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 483–498. Association for Computational Linguistics, Online (Jun 2021), https://aclanthology.org/2021.naacl-main.41
- In: Proceedings of the 14th ACM International Conference on web search and data mining. pp. 1154–1156 (2021)
- In: Proceedings of the 1st Workshop on Multilingual Representation Learning. pp. 127–137. Association for Computational Linguistics, Punta Cana, Dominican Republic (Nov 2021), https://aclanthology.org/2021.mrl-1.12
- Transactions of the Association for Computational Linguistics 11, 1114–1131 (09 2023), https://doi.org/10.1162/tacl_a_00595
- arXiv preprint arXiv:2211.14876 (2022)
- arXiv preprint arXiv:2308.07107 (2023)
- In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 2308–2313 (2023)
- Sławomir Dadas (11 papers)
- Małgorzata Grębowiec (3 papers)