Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Soft Language Clustering for Multilingual Model Pre-training (2306.07610v1)

Published 13 Jun 2023 in cs.CL

Abstract: Multilingual pre-trained LLMs have demonstrated impressive (zero-shot) cross-lingual transfer abilities, however, their performance is hindered when the target language has distant typology from source languages or when pre-training data is limited in size. In this paper, we propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods. On the tasks of XTREME including text classification, sequence labeling, question answering, and sentence retrieval, both base- and large-size LLMs pre-trained with our proposed method exhibit consistent performance improvement. Furthermore, it provides substantial advantages for low-resource languages in unsupervised sentence retrieval and for target languages that differ greatly from the source language in cross-lingual transfer.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jiali Zeng (24 papers)
  2. Yufan Jiang (17 papers)
  3. Yongjing Yin (19 papers)
  4. Yi Jing (9 papers)
  5. Fandong Meng (174 papers)
  6. Binghuai Lin (20 papers)
  7. Yunbo Cao (43 papers)
  8. Jie Zhou (687 papers)
Citations (4)