Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Model Performance in Multilingual Information Retrieval with Comprehensive Data Engineering Techniques (2302.07010v1)

Published 14 Feb 2023 in cs.IR and cs.AI

Abstract: In this paper, we present our solution to the Multilingual Information Retrieval Across a Continuum of Languages (MIRACL) challenge of WSDM CUP 2023\footnote{https://project-miracl.github.io/}. Our solution focuses on enhancing the ranking stage, where we fine-tune pre-trained multilingual transformer-based models with MIRACL dataset. Our model improvement is mainly achieved through diverse data engineering techniques, including the collection of additional relevant training data, data augmentation, and negative sampling. Our fine-tuned model effectively determines the semantic relevance between queries and documents, resulting in a significant improvement in the efficiency of the multilingual information retrieval process. Finally, Our team is pleased to achieve remarkable results in this challenging competition, securing 2nd place in the Surprise-Languages track with a score of 0.835 and 3rd place in the Known-Languages track with an average nDCG@10 score of 0.716 across the 16 known languages on the final leaderboard.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Qi Zhang (785 papers)
  2. Zijian Yang (20 papers)
  3. Yilun Huang (14 papers)
  4. Ze Chen (38 papers)
  5. Zijian Cai (12 papers)
  6. Kangxu Wang (4 papers)
  7. Jiewen Zheng (8 papers)
  8. Jiarong He (9 papers)
  9. Jin Gao (38 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.