Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BigTranslate: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages (2305.18098v3)

Published 29 May 2023 in cs.CL
BigTranslate: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages

Abstract: LLMs demonstrate promising translation performance among various natural languages. However, many LLMs especially the open-sourced ones, such as BLOOM and LLaMA, are English-dominant and support only dozens of natural languages, making the potential of LLMs on language translation less explored. In this work, we present BigTranslate which adapts LLaMA that covers only 20 languages and enhances it with multilingual translation capability on more than 100 languages. BigTranslate is built upon LLaMA-13B and it is optimized in three steps. First, we continue training LLaMA with massive Chinese monolingual data. Second, we continue training the model with a large-scale parallel dataset that covers 102 natural languages. Third, we instruct-tune the foundation model with multilingual translation instructions, leading to our BigTranslate model. The preliminary experiments on multilingual translation show that BigTranslate performs comparably with ChatGPT and Google Translate in many languages and even outperforms ChatGPT in 8 language pairs. We release the BigTranslate model and hope it can advance the research progress.

An Expert Assessment of BigTranslate: Enhancing LLMs with Multilingual Translation Capabilities

In the paper "BigTranslate: Augmenting LLMs with Multilingual Translation Capability Over 100 Languages," the authors tackle a critical limitation in the field of LLMs, which have traditionally exhibited strong proficiency primarily in English and a limited set of other languages. By introducing BigTranslate, the authors aim to bridge the gap in multilingual translation capabilities by adapting the LLaMA model to support over 100 languages.

The paper begins by contextualizing the existing landscape of LLMs, highlighting their potential in translation tasks but also pointing out their limitations in terms of language support. The research then presents BigTranslate, a model built upon LLaMA-13B, which originally supports only 20 languages. The methodology involves a three-step optimization process to enhance LLMs with multilingual capabilities.

The optimization process includes:

  1. Continued Training with Chinese Monolingual Data: This step focuses on strengthening the model's capabilities in Chinese, a language that typically exhibits low cross-lingual similarity with others. By doing so, the model is better positioned to serve as a bridge for Chinese-centered multilingual translation.
  2. Training on a Large-Scale Parallel Dataset: The authors incorporate a parallel dataset covering 102 languages to augment the LLMs with multilingual functionality. This involves a novel incremental curriculum learning approach to improve balance across both high-resource and low-resource languages.
  3. Instruction Tuning: The final step involves refining the multilingual model with structured translation instructions, enhancing its ability to generate translations in diverse contexts.

Evaluation results show BigTranslate performing comparably to established systems like ChatGPT and Google Translate in many language pairs. Notably, in 8 language pairs, BigTranslate even surpasses ChatGPT. A significant element of their evaluation was employing GPT-4 to supplement BLEU scores, given recognized limitations in BLEU's correlation with human judgment.

Implications and Speculation on Future Developments

The authors' work enhances our understanding of how integrating comprehensive multilingual datasets and strategic instruction tuning can bridge performance gaps in LLMs across a broad spectrum of languages. Practically, this contributes to expanding LLM applications in global markets, enabling access for a larger baseline of the world's population for whom technology solutions in their native languages were previously limited.

From a theoretical perspective, the incremental data sampling strategy, akin to curriculum learning, introduces a structured methodology that can be extended to other LLM advancements beyond translation, inviting further research into balancing learning between resource-rich and resource-poor languages.

In future developments, there could be a pivot towards enhancing BigTranslate's ability in low-resource languages without relying heavily on data augmentation. Additionally, exploring the transfer of other LLM capabilities—such as semantic understanding and question-answering—into these newly supported languages could further improve the model's broad applicability.

Overall, the paper effectively contributes to advancing LLM capabilities, and future research could focus on refining translation quality and extending the model's capacities in other sophisticated NLP tasks, thereby continuing to democratize AI access and usability across linguistic demographics.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Wen Yang (185 papers)
  2. Chong Li (112 papers)
  3. Jiajun Zhang (176 papers)
  4. Chengqing Zong (65 papers)
Citations (37)
X Twitter Logo Streamline Icon: https://streamlinehq.com