Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language-agnostic BERT Sentence Embedding (2007.01852v2)

Published 3 Jul 2020 in cs.CL
Language-agnostic BERT Sentence Embedding

Abstract: While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning (Reimers and Gurevych, 2019), BERT based cross-lingual sentence embeddings have yet to be explored. We systematically investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations including: masked LLMing (MLM), translation LLMing (TLM) (Conneau and Lample, 2019), dual encoder translation ranking (Guo et al., 2018), and additive margin softmax (Yang et al., 2019a). We show that introducing a pre-trained multilingual LLM dramatically reduces the amount of parallel training data required to achieve good performance by 80%. Composing the best of these methods produces a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba, well above the 65.5% achieved by Artetxe and Schwenk (2019b), while still performing competitively on monolingual transfer learning benchmarks (Conneau and Kiela, 2018). Parallel data mined from CommonCrawl using our best model is shown to train competitive NMT models for en-zh and en-de. We publicly release our best multilingual sentence embedding model for 109+ languages at https://tfhub.dev/google/LaBSE.

Language-agnostic BERT Sentence Embedding

Feng et al.'s paper, Language-agnostic BERT Sentence Embedding (LaBSE), systematically explores and evaluates methods for developing multilingual sentence embeddings using a pre-trained multilingual LLM combined with dual-encoder translation ranking. The primary objective is to achieve robust cross-lingual sentence embeddings valuable for tasks like clustering, retrieval, and transfer learning across a wide range of languages.

Methodology

The research employs dual-encoder models leveraging BERT-based encoding modules for each language in the sentence pair independently. The investigation contrasts models initialized with large pre-trained LLMs against those trained from scratch on translation pairs. Critical techniques evaluated in the paper include masked LLMing (MLM), translation LLMing (TLM), and additive margin softmax (AMS).

1. Pre-training Multilingual Models:

The pre-training combines MLM and TLM approaches, extending the capabilities of models like mBERT, XLM, and XLM-R. This pre-training phase dramatically reduces the amount of parallel training data required, making efficient use of monolingual and bilingual corpora.

2. Dual Encoder with Additive Margin Softmax:

LaBSE models are trained using a dual-encoder framework with translation ranking loss, supplemented by additive margin softmax to enhance the separation between correct translations and non-translations. Additionally, cross-accelerator negative sampling maximizes the utility of larger batch sizes, essential for training robust dual-encoder models.

Experimental Evaluation

The performance of LaBSE is rigorously evaluated on several benchmark datasets:

1. Bi-text Retrieval:

Evaluations on the United Nations (UN) corpus, Tatoeba, and BUCC datasets illustrate the effectiveness of LaBSE. The LaBSE model significantly outperforms previous state-of-the-art methods, achieving 83.7% bi-text retrieval accuracy over 112 languages on the Tatoeba dataset compared to the 65.5% previously attained by the LASER model.

2. Transfer Learning on Downstream Tasks:

LaBSE demonstrates competitive performance on SentEval benchmarks, which focus on a variety of downstream classification tasks including sentiment analysis and paraphrase detection. Despite its extensive multilingual capabilities, LaBSE holds its ground against specialized monolingual sentence embedding models.

Key Findings and Contributions

  • Enhanced Cross-lingual Embeddings:

The integration of pre-trained multilingual models with dual-encoder fine-tuning establishes a new standard for bi-text mining, especially notable in large retrieval tasks.

  • Reduced Data Requirements:

Pre-training with MLM and TLM reduces the dependency on parallel data by 80%, facilitating the development of multilingual models with significantly less data.

  • Robust to Low-resource Languages:

LaBSE exhibits high performance even on languages with scarce monolingual or bilingual data, indicating substantial generalization capabilities across linguistically diverse settings.

  • Public Release and Accessibility:

The best-performing LaBSE model covering 109+ languages is made publicly available, enhancing accessibility for researchers and developers working on multilingual natural language processing.

Implications and Future Work

The contributions of LaBSE indicate several implications for both practical and theoretical aspects of natural language processing:

  • Practical Applications:

The robust bi-text retrieval performance suggests that LaBSE can significantly aid in building high-quality translation systems, cross-lingual information retrieval applications, and multilingual content understanding systems.

  • Theoretical Advancements:

This research emphasizes the importance of effective pre-training and negative sampling strategies, laying the groundwork for further exploration into optimizing cross-lingual and multilingual models.

Future developments might include:

  • Hard Negative Sampling Variants:

Exploring more sophisticated hard negative mining strategies could further enhance performance, particularly with models of increased capacity like LaBSE\textsubscript{Large}.

  • Distillation and Student Models:

Combining LaBSE’s approach with emerging methods such as model distillation may yield even more efficient multilingual models without sacrificing accuracy.

In summary, LaBSE effectively addresses the complexities of generating language-agnostic sentence embeddings across a diverse range of languages, setting a new benchmark in the field of multilingual NLP. The publicly available model further democratizes access to advanced multilingual embeddings, catalyzing future research and practical innovations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Fangxiaoyu Feng (5 papers)
  2. Yinfei Yang (73 papers)
  3. Daniel Cer (28 papers)
  4. Naveen Arivazhagan (15 papers)
  5. Wei Wang (1793 papers)
Citations (784)
Youtube Logo Streamline Icon: https://streamlinehq.com