Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DICT-MLM: Improved Multilingual Pre-Training using Bilingual Dictionaries (2010.12566v1)

Published 23 Oct 2020 in cs.CL

Abstract: Pre-trained multilingual LLMs such as mBERT have shown immense gains for several NLP tasks, especially in the zero-shot cross-lingual setting. Most, if not all, of these pre-trained models rely on the masked-LLMing (MLM) objective as the key language learning objective. The principle behind these approaches is that predicting the masked words with the help of the surrounding text helps learn potent contextualized representations. Despite the strong representation learning capability enabled by MLM, we demonstrate an inherent limitation of MLM for multilingual representation learning. In particular, by requiring the model to predict the language-specific token, the MLM objective disincentivizes learning a language-agnostic representation -- which is a key goal of multilingual pre-training. Therefore to encourage better cross-lingual representation learning we propose the DICT-MLM method. DICT-MLM works by incentivizing the model to be able to predict not just the original masked word, but potentially any of its cross-lingual synonyms as well. Our empirical analysis on multiple downstream tasks spanning 30+ languages, demonstrates the efficacy of the proposed approach and its ability to learn better multilingual representations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Aditi Chaudhary (24 papers)
  2. Karthik Raman (26 papers)
  3. Krishna Srinivasan (14 papers)
  4. Jiecao Chen (23 papers)
Citations (23)