Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings (2402.17016v1)

Published 26 Feb 2024 in cs.CL, cs.AI, and cs.IR
Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings

Abstract: We introduce a novel suite of state-of-the-art bilingual text embedding models that are designed to support English and another target language. These models are capable of processing lengthy text inputs with up to 8192 tokens, making them highly versatile for a range of natural language processing tasks such as text retrieval, clustering, and semantic textual similarity (STS) calculations. By focusing on bilingual models and introducing a unique multi-task learning objective, we have significantly improved the model performance on STS tasks, which outperforms the capabilities of existing multilingual models in both target language understanding and cross-lingual evaluation tasks. Moreover, our bilingual models are more efficient, requiring fewer parameters and less memory due to their smaller vocabulary needs. Furthermore, we have expanded the Massive Text Embedding Benchmark (MTEB) to include benchmarks for German and Spanish embedding models. This integration aims to stimulate further research and advancement in text embedding technologies for these languages.

Advancements in Bilingual Text Embeddings: A Dive into Multi-Task Contrastive Learning

Introduction

The evolution of text embedding models has played a pivotal role in propelling NLP applications and research forward. With a significant emphasis on understanding and retrieving semantic meaning from large text corpora, these models have become indispensable. While monolingual models have predominantly been designed with English in mind, the need for effective multilingual models has surged as the digital world becomes increasingly global. Addressing the limitations associated with existing multilingual models, this paper introduces a novel suite of bilingual text embedding models, dedicated to efficiently processing English alongside another target language. The innovation extends to the model's ability to handle extensive text lengths, up to 8192 tokens, and the integration of a multi-task learning strategy to refine its performance on semantic textual similarity (STS) and retrieval tasks.

Constructing Bilingual Models

At the core of the paper, the development of bilingual models takes precedence over the more commonly used multilingual frameworks. This strategic decision arises from the observation that most use cases in the industry seldom require the extensive language support that multilingual models offer. Therefore, by focusing on specific language pairs, these bilingual models not only reduce unnecessary computational overhead but also improve the qualitative performance on the targeted languages. To achieve this, the models are built on customized backbones, supporting up to 8192 tokens. These bilingual models undergo a fine-tuning process, enriched with a multi-task learning objective, distinguishing their approach from traditional models.

Multi-Task Learning and Fine-Tuning

Multi-task learning has emerged as a robust method for improving model performance across several related tasks simultaneously. In this paper, a hard parameter-sharing strategy is adopted, alongside an unconventional approach to loss calculation. Rather than aggregating losses from multiple tasks, a specific task is selected in each training iteration, focusing the learning process on the distinct challenges posed by each task. This methodology has shown marked improvements in the model's ability to handle STS and retrieval tasks effectively.

Evaluation and Benchmarks

A comprehensive evaluation framework is established, incorporating the Massive Text Embedding Benchmark (MTEB) extended to include benchmarks for the German and Spanish language pairs. The bilingual models demonstrate superior performance across these tasks when compared to their multilingual counterparts, especially in cross-lingual retrieval settings. This signifies not only the models’ effectiveness in handling bilingual text embedding tasks but also highlights the efficiency gained through the focused nature of the models.

Implications and Future Directions

The research presents profound implications for the development of LLMs, especially in contexts where precise and efficient language understanding is crucial. By proving the efficacy of bilingual models in comparison to multilingual alternatives, a pathway for more focused and potentially more efficient NLP applications is forged. Furthermore, the introduction of a multi-task learning strategy emphasizes the potential for simultaneous improvements across various NLP tasks, setting a precedent for future research in the domain. It opens avenues for exploring further bilingual combinations and refined multi-task learning objectives tailored to specific aspects of language understanding and processing.

Conclusion

The advancements in bilingual text embeddings as outlined in this paper represent a significant step forward in natural language processing capabilities. By addressing the constraints of multilingual models and pivoting towards a more targeted approach, these bilingual models offer a promising avenue for enhancing semantic understanding and retrieval tasks. Coupled with the innovative application of multi-task learning, the models set a new standard for bilingual text processing, fostering continued exploration and improvement in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8440–8451. Association for Computational Linguistics, 2020. doi:10.18653/V1/2020.ACL-MAIN.747. URL https://doi.org/10.18653/v1/2020.acl-main.747.
  2. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4512–4525, 2020.
  3. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  4. MTEB: massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 2006–2029. Association for Computational Linguistics, 2023. doi:10.18653/V1/2023.EACL-MAIN.148. URL https://doi.org/10.18653/v1/2023.eacl-main.148.
  5. Jina embeddings 2: 8192-token general-purpose text embeddings for long documents. arXiv preprint arXiv:2310.19923, 2023a.
  6. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  8. Cross-lingual language model pretraining. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 7057–7067, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/c04c19c2c2474dbf5f7ac4372c5b9af1-Abstract.html.
  9. Massively multilingual neural machine translation in the wild: Findings and challenges. CoRR, abs/1907.05019, 2019. URL http://arxiv.org/abs/1907.05019.
  10. Analyzing the mono- and cross-lingual pretraining dynamics of multilingual language models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 3575–3590. Association for Computational Linguistics, 2022. doi:10.18653/V1/2022.EMNLP-MAIN.234. URL https://doi.org/10.18653/v1/2022.emnlp-main.234.
  11. On negative interference in multilingual models: Findings and A meta-learning treatment. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 4438–4450. Association for Computational Linguistics, 2020. doi:10.18653/V1/2020.EMNLP-MAIN.359. URL https://doi.org/10.18653/v1/2020.emnlp-main.359.
  12. Lifting the curse of multilinguality by pre-training modular transformers. In Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 3479–3495. Association for Computational Linguistics, 2022. doi:10.18653/V1/2022.NAACL-MAIN.255. URL https://doi.org/10.18653/v1/2022.naacl-main.255.
  13. MAD-X: an adapter-based framework for multi-task cross-lingual transfer. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 7654–7673. Association for Computational Linguistics, 2020. doi:10.18653/V1/2020.EMNLP-MAIN.617. URL https://doi.org/10.18653/v1/2020.emnlp-main.617.
  14. Allocating large vocabulary capacity for cross-lingual language model pre-training. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 3203–3215. Association for Computational Linguistics, 2021. doi:10.18653/V1/2021.EMNLP-MAIN.257. URL https://doi.org/10.18653/v1/2021.emnlp-main.257.
  15. Rethinking embedding coupling in pre-trained language models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=xpFFI_NtgpW.
  16. Bert, mbert, or bibert? A study on contextualized embeddings for neural machine translation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6663–6675. Association for Computational Linguistics, 2021. doi:10.18653/V1/2021.EMNLP-MAIN.534. URL https://doi.org/10.18653/v1/2021.emnlp-main.534.
  17. Towards fully bilingual deep language modeling. CoRR, abs/2010.11639, 2020. URL https://arxiv.org/abs/2010.11639.
  18. Finest BERT and crosloengual BERT - less is more in multilingual models. In Petr Sojka, Ivan Kopecek, Karel Pala, and Ales Horák, editors, Text, Speech, and Dialogue - 23rd International Conference, TSD 2020, Brno, Czech Republic, September 8-11, 2020, Proceedings, volume 12284 of Lecture Notes in Computer Science, pages 104–111. Springer, 2020. doi:10.1007/978-3-030-58323-1_11. URL https://doi.org/10.1007/978-3-030-58323-1_11.
  19. Rich Caruana. Multitask learning. Mach. Learn., 28(1):41–75, 1997. doi:10.1023/A:1007379606734. URL https://doi.org/10.1023/A:1007379606734.
  20. Michael Crawshaw. Multi-task learning with deep neural networks: A survey. CoRR, abs/2009.09796, 2020. URL https://arxiv.org/abs/2009.09796.
  21. Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence research, 12:149–198, 2000.
  22. Learning general purpose distributed sentence representations via large scale multi-task learning. arXiv preprint arXiv:1804.00079, 2018.
  23. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114, 2015.
  24. Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 2: Short Papers, pages 845–850. The Association for Computer Linguistics, 2015. doi:10.3115/V1/P15-2139. URL https://doi.org/10.3115/v1/p15-2139.
  25. Dynamic task prioritization for multitask learning. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XVI, volume 11220 of Lecture Notes in Computer Science, pages 282–299. Springer, 2018. doi:10.1007/978-3-030-01270-0_17. URL https://doi.org/10.1007/978-3-030-01270-0_17.
  26. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 793–802. PMLR, 2018. URL http://proceedings.mlr.press/v80/chen18a.html.
  27. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 7482–7491. Computer Vision Foundation / IEEE Computer Society, 2018. doi:10.1109/CVPR.2018.00781. URL http://openaccess.thecvf.com/content_cvpr_2018/html/Kendall_Multi-Task_Learning_Using_CVPR_2018_paper.html.
  28. Train short, test long: Attention with linear biases enables input length extrapolation, 2022.
  29. Layer normalization, 2016.
  30. Query-key normalization for transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253, 2020.
  31. Neural machine translation of rare words with subword units, 2016.
  32. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages, 2023.
  33. Jörg Tiedemann. OPUS – parallel corpora for everyone. In Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products, Riga, Latvia, May 30–June 1 2016. Baltic Journal of Modern Computing. URL https://aclanthology.org/2016.eamt-2.8.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  35. Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data. arXiv e-prints, art. arXiv:2212.10440, December 2022. doi:10.48550/arXiv.2212.10440.
  36. Roberta: A robustly optimized bert pretraining approach, 2019.
  37. Dynamic masking rate schedules for mlm pretraining, 2024.
  38. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.
  39. Towards general text embeddings with multi-stage contrastive learning, 2023.
  40. Mfaq: a multilingual faq dataset. In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, pages 1–13, 2021.
  41. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online, August 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.findings-acl.413. URL https://aclanthology.org/2021.findings-acl.413.
  42. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, 2018.
  43. Mlsum: The multilingual summarization corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8051–8067, 2020.
  44. Jörg Tiedemann. Parallel data, tools and interfaces in opus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, 2012.
  45. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  46. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, 2022.
  47. Synthetic qa corpora generation with roundtrip consistency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6168–6173, 2019.
  48. Promptagator: Few-shot dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755, 2022.
  49. Jina embeddings: A novel set of high-performance sentence embedding models. arXiv preprint arXiv:2307.11224, 2023b.
  50. Representation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018. URL http://arxiv.org/abs/1807.03748.
  51. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, 2017.
  52. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.
  53. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.
  54. Facebook fair’s WMT19 news translation task submission. In Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno-Yepes, Philipp Koehn, André Martins, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana L. Neves, Matt Post, Marco Turchi, and Karin Verspoor, editors, Proceedings of the Fourth Conference on Machine Translation, WMT 2019, Florence, Italy, August 1-2, 2019 - Volume 2: Shared Task Papers, Day 1, pages 314–319. Association for Computational Linguistics, 2019. doi:10.18653/V1/W19-5333. URL https://doi.org/10.18653/v1/w19-5333.
  55. MADLAD-400: A multilingual and document-level large audited dataset. CoRR, abs/2309.04662, 2023. doi:10.48550/ARXIV.2309.04662. URL https://doi.org/10.48550/arXiv.2309.04662.
  56. GerDaLIR: A German dataset for legal information retrieval. In Nikolaos Aletras, Ion Androutsopoulos, Leslie Barrett, Catalina Goanta, and Daniel Preotiuc-Pietro, editors, Proceedings of the Natural Legal Language Processing Workshop 2021, pages 123–128, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.nllp-1.13. URL https://aclanthology.org/2021.nllp-1.13.
  57. Germanquad and germandpr: Improving non-english question answering and passage retrieval, 2021.
  58. Maria: Spanish language models. Procesamiento del Lenguaje Natural, page 39–60, 2022. ISSN 1989-7553. doi:10.26342/2022-68-3. URL https://doi.org/10.26342/2022-68-3.
  59. Making a miracl: Multilingual information retrieval across a continuum of languages, 2022.
  60. Karl Pearson. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58:240–242, 1895. ISSN 03701662. URL http://www.jstor.org/stable/115794.
  61. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=rJ4km2R5t7.
  62. Xtreme: a massively multilingual multi-task benchmark for evaluating cross-lingual generalization. In Proceedings of the 37th International Conference on Machine Learning, pages 4411–4421, 2020.
  63. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. CoRR, abs/1811.01088, 2018. URL http://arxiv.org/abs/1811.01088.
  64. Overview of the second bucc shared task: Spotting parallel sentences in comparable corpora. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora, pages 60–67, 2017.
  65. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610, 2019.
  66. Cross-market product recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. ACM, 2021.
  67. A test collection for passage retrieval evaluation of spanish health-related resources. In European Conference on Information Retrieval, 2019.
  68. Mintaka: A complex, natural, and multilingual dataset for end-to-end question answering. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na, editors, Proceedings of the 29th International Conference on Computational Linguistics, pages 1604–1619, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.138.
  69. xpqa: Cross-lingual product question answering across 12 languages, 2023.
  70. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Trans. Assoc. Comput. Linguistics, 10:522–538, 2022. doi:10.1162/TACL_A_00474. URL https://doi.org/10.1162/tacl_a_00474.
  71. Semeval-2014 task 10: Multilingual semantic textual similarity. In SemEval@ COLING, pages 81–91, 2014.
  72. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pages 252–263, 2015.
  73. PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification. In Proc. of EMNLP, 2019.
  74. A SICK cure for the evaluation of compositional distributional semantic models. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 216–223, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). URL http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf.
  75. Semeval-2012 task 6: A pilot on semantic textual similarity. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 385–393, 2012.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (19)
  1. Isabelle Mohr (10 papers)
  2. Markus Krimmel (7 papers)
  3. Saba Sturua (8 papers)
  4. Mohammad Kalim Akram (7 papers)
  5. Andreas Koukounas (5 papers)
  6. Michael Günther (47 papers)
  7. Georgios Mastrapas (7 papers)
  8. Vinit Ravishankar (11 papers)
  9. Joan Fontanals Martínez (2 papers)
  10. Feng Wang (408 papers)
  11. Qi Liu (485 papers)
  12. Ziniu Yu (2 papers)
  13. Jie Fu (229 papers)
  14. Saahil Ognawala (10 papers)
  15. Susana Guzman (3 papers)
  16. Bo Wang (823 papers)
  17. Maximilian Werk (3 papers)
  18. Nan Wang (147 papers)
  19. Han Xiao (104 papers)
Citations (11)