Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Extrapolating Large Language Models to Non-English by Aligning Languages (2308.04948v2)

Published 9 Aug 2023 in cs.CL

Abstract: Existing LLMs show disparate capability across different languages, due to the imbalance in the training data. Their performances on English tasks are often stronger than on tasks of other languages. In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages. We start from targeting individual languages by performing cross-lingual instruction-tuning (CoIT) on LLaMA, i.e. tuning it with translation task data and cross-lingual general task data to obtain cross-lingual models (x-LLaMAs), and formulate underlying scaling laws to investigate the advantages of using scalable translation data. Then we perform multilingual instruction-tuning (MuIT) with mixed resources to build multilingual m-LLaMA. We also illustrate how we leverage the scaling laws to optimize data allocation in a resource-constrained setting. Experiment results on cross-lingual benchmarks XQUAD and MLQA show that x-LLaMAs surpass the English instruction-tuned counterpart (Alpaca) by an average of 27.83% across six non-English languages. Evaluation results on translation dataset Flores-101 show that x-LLaMAs outperform previous LLaMA-based models by an average of 18.89%. Encouragingly, m-LLaMA achieves comparable performance to x-LLaMAs on individual languages and demonstrates the ability to follow multilingual instructions. Further analysis on response content and representation space reveals the alignment of the multilingual semantic space within the middle layers of m-LLaMA.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Wenhao Zhu (32 papers)
  2. Yunzhe Lv (3 papers)
  3. Qingxiu Dong (39 papers)
  4. Fei Yuan (28 papers)
  5. Jingjing Xu (80 papers)
  6. Shujian Huang (106 papers)
  7. Lingpeng Kong (134 papers)
  8. Jiajun Chen (125 papers)
  9. Lei Li (1293 papers)
Citations (52)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub