Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Multi-lingual Alignment Through Soft Contrastive Learning (2405.16155v2)

Published 25 May 2024 in cs.CL

Abstract: Making decent multi-lingual sentence representations is critical to achieve high performances in cross-lingual downstream tasks. In this work, we propose a novel method to align multi-lingual embeddings based on the similarity of sentences measured by a pre-trained mono-lingual embedding model. Given translation sentence pairs, we train a multi-lingual model in a way that the similarity between cross-lingual embeddings follows the similarity of sentences measured at the mono-lingual teacher model. Our method can be considered as contrastive learning with soft labels defined as the similarity between sentences. Our experimental results on five languages show that our contrastive loss with soft labels far outperforms conventional contrastive loss with hard labels in various benchmarks for bitext mining tasks and STS tasks. In addition, our method outperforms existing multi-lingual embeddings including LaBSE, for Tatoeba dataset. The code is available at https://github.com/YAI12xLinq-B/IMASCL

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, pages 597–610.
  2. Learning cross-lingual sentence representations via a multi-task dual-encoder model. arXiv preprint arXiv:1810.12836.
  3. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
  4. Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. Advances in neural information processing systems, 32.
  5. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  7. Sonar: sentence-level multimodal and language-agnostic representations. arXiv e-prints, pages arXiv–2308.
  8. Language-agnostic bert sentence embedding. arXiv preprint arXiv:2007.01852.
  9. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
  10. Jiyeon Ham and Eun-Sol Kim. 2021. Semantic alignment with calibrated similarity for multilingual sentence embedding. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1781–1791.
  11. Bitext mining using distilled sentence representations for low-resource languages. arXiv preprint arXiv:2205.12654.
  12. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
  13. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191.
  14. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  15. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
  16. Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813.
  17. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867.
  18. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Lrec, volume 2012, pages 2214–2218.
  19. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
  20. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368.
  21. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672.
  22. Multilingual universal sentence encoder for semantic retrieval. arXiv preprint arXiv:1907.04307.
  23. Contrastive data and learning for natural language processing. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts, pages 39–47.
  24. Overview of the second bucc shared task: Spotting parallel sentences in comparable corpora. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora, pages 60–67.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Minsu Park (21 papers)
  2. Seyeon Choi (1 paper)
  3. Chanyeol Choi (13 papers)
  4. Jun-Seong Kim (2 papers)
  5. Jy-yong Sohn (37 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets