Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HLTCOE at TREC 2023 NeuCLIR Track (2404.08118v1)

Published 11 Apr 2024 in cs.CL and cs.IR

Abstract: The HLTCOE team applied PLAID, an mT5 reranker, and document translation to the TREC 2023 NeuCLIR track. For PLAID we included a variety of models and training techniques -- the English model released with ColBERT v2, translate-train~(TT), Translate Distill~(TD) and multilingual translate-train~(MTT). TT trains a ColBERT model with English queries and passages automatically translated into the document language from the MS-MARCO v1 collection. This results in three cross-LLMs for the track, one per language. MTT creates a single model for all three document languages by combining the translations of MS-MARCO passages in all three languages into mixed-language batches. Thus the model learns about matching queries to passages simultaneously in all languages. Distillation uses scores from the mT5 model over non-English translated document pairs to learn how to score query-document pairs. The team submitted runs to all NeuCLIR tasks: the CLIR and MLIR news task as well as the technical documents task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Patapsco: A Python Framework for Cross-Language Information Retrieval Experiments. In Proceedings of the 44th European Conference on Information Retrieval (ECIR).
  2. Zhuyun Dai and Jamie Callan. 2019. Deeper text understanding for IR with contextual neural language modeling. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 985–988.
  3. Kareem Darwish and Douglas W Oard. 2003. Probabilistic structured query methods. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 338–344.
  4. SPLADE: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2288–2292.
  5. spaCy: Industrial-strength Natural Language Processing in Python. (2020). https://doi.org/10.5281/zenodo.1212303
  6. NeuralMind-UNICAMP at 2022 TREC NeuCLIR: Large Boring Rerankers for Cross-lingual Retrieval. arXiv preprint arXiv:2303.16145 (2023).
  7. Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48.
  8. Overview of the TREC 2022 NeuCLIR Track. arXiv:2304.12367 [cs.IR]
  9. Neural Approaches to Multilingual Information Retrieval. In European Conference on Information Retrieval. Springer, 521–536.
  10. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 2356–2362.
  11. Parsivar: A Language Processing Toolkit for Persian. In International Conference on Language Resources and Evaluation. https://api.semanticscholar.org/CorpusID:21715688
  12. Suarj Nair and Douglas W. Oard. 2023. BLADE: The University of Maryland at the TREC 2023 NeuCLIR Track. In Proceedings of The Sixteenth Text REtrieval Conference Proceedings (TREC 2023).
  13. Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models. In Advances in Information Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part I (Stavanger, Norway). Springer-Verlag, Berlin, Heidelberg, 382–396. https://doi.org/10.1007/978-3-030-99736-6_26
  14. BLADE: Combining Vocabulary Pruning and Intermediate Pretraining for Scaleable Neural CLIR. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (Taipei, Taiwan) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 1219–1229. https://doi.org/10.1145/3539618.3591644
  15. Document Ranking with a Pretrained Sequence-to-Sequence Model. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 708–718. https://doi.org/10.18653/v1/2020.findings-emnlp.63
  16. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 5835–5847. https://aclanthology.org/2021.naacl-main.466
  17. PLAID: an efficient engine for late interaction retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 1747–1756.
  18. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 3715–3734. https://aclanthology.org/2022.naacl-main.272
  19. Jianqiang Wang and Douglas W. Oard. 2012. Matching meaning for cross-language information retrieval. Information Processing & Management 48, 4 (2012), 631–653. https://doi.org/10.1016/j.ipm.2011.09.003
  20. Jinxi Xu and Ralph Weischedel. 2000. Cross-lingual information retrieval using hidden Markov models. In 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. 95–103.
  21. Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation. In Advances in Information Retrieval: 46th European Conference on IR Research, ECIR 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Eugene Yang (38 papers)
  2. Dawn Lawrie (31 papers)
  3. James Mayfield (21 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.