Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mapping Transformer Leveraged Embeddings for Cross-Lingual Document Representation (2401.06583v1)

Published 12 Jan 2024 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: Recommendation systems, for documents, have become tools to find relevant content on the Web. However, these systems have limitations when it comes to recommending documents in languages different from the query language, which means they might overlook resources in non-native languages. This research focuses on representing documents across languages by using Transformer Leveraged Document Representations (TLDRs) that are mapped to a cross-lingual domain. Four multilingual pre-trained transformer models (mBERT, mT5 XLM RoBERTa, ErnieM) were evaluated using three mapping methods across 20 language pairs representing combinations of five selected languages of the European Union. Metrics like Mate Retrieval Rate and Reciprocal Rank were used to measure the effectiveness of mapped TLDRs compared to non-mapped ones. The results highlight the power of cross-lingual representations achieved through pre-trained transformers and mapping approaches suggesting a promising direction for expanding beyond language connections, between two specific languages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Towards personalized multilingual information access - exploring the browsing and search behavior of multilingual users. In Vania Dimitrova, Tsvi Kuflik, David Chin, Francesco Ricci, Peter Dolog, and Geert-Jan Houben, editors, User Modeling, Adaptation, and Personalization, pages 435–446, Cham, 2014. Springer International Publishing. ISBN 978-3-319-08786-3.
  2. NCC: Neural concept compression for multilingual document recommendation. Applied Soft Computing, 142:110348, July 2023. ISSN 15684946. doi:10.1016/j.asoc.2023.110348.
  3. Mars: A MultilAnguage Recommender System. In Proceedings of the 1st International Workshop on Information Heterogeneity and Fusion in Recommender Systems, HetRec ’10, page 24–31, New York, NY, USA, 2010. ACM. ISBN 9781450304078.
  4. Concept-based item representations for a cross-lingual content-based recommendation process. Information Sciences, 374:15–31, 2016. ISSN 0020-0255.
  5. Linear concept approximation for multilingual document recommendation. In Intelligent Data Engineering and Automated Learning – IDEAL 2021: 22nd International Conference, IDEAL 2021, Manchester, UK, November 25–27, 2021, Proceedings, page 147–156, Berlin, Heidelberg, 2021. Springer-Verlag. ISBN 978-3-030-91607-7. doi:10.1007/978-3-030-91608-4_15. URL https://doi.org/10.1007/978-3-030-91608-4_15.
  6. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-long.62. URL https://aclanthology.org/2022.acl-long.62.
  7. Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168, 2013. URL http://arxiv.org/abs/1309.4168.
  8. Word translation without parallel data. In International Conference on Learning Representations, pages 1–14, 2018. URL https://openreview.net/forum?id=H196sainb.
  9. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv:1702.03859, 2017.
  10. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1006–1011, Denver, Colorado, May–June 2015. Association for Computational Linguistics. doi:10.3115/v1/N15-1104. URL https://aclanthology.org/N15-1104.
  11. Unsupervised cross-lingual information retrieval using monolingual data only. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, page 1253–1256, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450356572. doi:10.1145/3209978.3210157. URL https://doi.org/10.1145/3209978.3210157.
  12. A wikipedia-based multilingual retrieval model. In Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, and Ryen W. White, editors, Advances in Information Retrieval, pages 522–530, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg. ISBN 978-3-540-78646-7.
  13. A knowledge-based representation for cross-language document retrieval and categorization. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 414–423, Gothenburg, Sweden, April 2014. Association for Computational Linguistics. doi:10.3115/v1/E14-1044. URL https://aclanthology.org/E14-1044.
  14. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407, 1990.
  15. Cross-lingual semantic similarity measure for comparable articles. In International Conference on Natural Language Processing, pages 105–115. Springer, 2014.
  16. Learning joint multilingual sentence representations with neural machine translation. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 157–167, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi:10.18653/v1/W17-2619. URL https://aclanthology.org/W17-2619.
  17. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610, 2019. doi:10.1162/tacl_a_00288. URL https://aclanthology.org/Q19-1038.
  18. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi:10.18653/v1/N19-1423.
  19. Emerging cross-lingual structure in pretrained language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6022–6034, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.536.
  20. Cross-lingual sentence embedding using multi-task learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9099–9113, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.emnlp-main.716. URL https://aclanthology.org/2021.emnlp-main.716.
  21. On cross-lingual retrieval with multilingual text encoders. Inf. Retr., 25(2):149–183, jun 2022. ISSN 1386-4564. doi:10.1007/s10791-022-09406-x. URL https://doi.org/10.1007/s10791-022-09406-x.
  22. Cross-Lingual Training of Neural Models for Document Ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2768–2773, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.249.
  23. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
  24. mt5: A massively multilingual pre-trained text-to-text transformer. CoRR, abs/2010.11934, 2020. URL https://arxiv.org/abs/2010.11934.
  25. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019. URL http://arxiv.org/abs/1910.10683.
  26. Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116, 2019. URL http://arxiv.org/abs/1911.02116.
  27. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692.
  28. ERNIE-M: Enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 27–38, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.emnlp-main.3. URL https://aclanthology.org/2021.emnlp-main.3.
  29. Learning inter-lingual document representations via concept compression. In Hujun Yin, David Camacho, Peter Tino, Richard Allmendinger, Antonio J. Tallón-Ballesteros, Ke Tang, Sung-Bae Cho, Paulo Novais, and Susana Nascimento, editors, Intelligent Data Engineering and Automated Learning – IDEAL 2021, pages 268–276, Cham, 2021. Springer International Publishing. ISBN 978-3-030-91608-4.
  30. Marc Lenz. Learning multilingual document representations, 2021.
  31. The jrc-acquis: A multilingual aligned parallel corpus with 20+ languages. CoRR, abs/cs/0609058, 2006. URL http://arxiv.org/abs/cs/0609058.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)

Summary

We haven't generated a summary for this paper yet.