Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages (2406.12739v1)

Published 18 Jun 2024 in cs.CL
Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages

Abstract: LLMs have become a go-to solution not just for text generation, but also for natural language understanding (NLU) tasks. Acquiring extensive knowledge through LLMing on web-scale corpora, they excel on English NLU, yet struggle to extend their NLU capabilities to underrepresented languages. In contrast, machine translation models (MT) produce excellent multilingual representations, resulting in strong translation performance even for low-resource languages. MT encoders, however, lack the knowledge necessary for comprehensive NLU that LLMs obtain through LLMing training on immense corpora. In this work, we get the best both worlds by integrating MT encoders directly into LLM backbones via sample-efficient self-distillation. The resulting MT-LLMs preserve the inherent multilingual representational alignment from the MT encoder, allowing lower-resource languages to tap into the rich knowledge embedded in English-centric LLMs. Merging the MT encoder and LLM in a single model, we mitigate the propagation of translation errors and inference overhead of MT decoding inherent to discrete translation-based cross-lingual transfer (e.g., translate-test). Evaluation spanning three prominent NLU tasks and 127 predominantly low-resource languages renders MT-LLMs highly effective in cross-lingual transfer. MT-LLMs substantially and consistently outperform translate-test based on the same MT model, showing that we truly unlock multilingual language understanding for LLMs.

Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages

This paper presents a novel methodology for enhancing the cross-lingual capabilities of LLMs by integrating machine translation (MT) encoders with LLM backbones through self-distillation. The resulting hybrid model, termed MT-LLM, aims to leverage the strengths of both LLMs and MT encoders to perform natural language understanding (NLU) tasks across a diverse set of languages, including many low-resource languages.

Introduction

LLMs like GPT-3 and Llama 3 have demonstrated impressive performance on a variety of NLU tasks, particularly in English. However, their effectiveness diminishes significantly for languages that are typologically distant from English or poorly represented in their training data. Conversely, state-of-the-art MT models like NLLB and MADLAD-400 provide strong multilingual representations but lack the extensive world knowledge embedded in LLMs. To bridge this gap, the authors propose integrating MT encoders directly into LLM backbones through a self-distillation process, thereby enhancing the cross-lingual transfer capabilities of LLMs.

Methodology

The integration is achieved in two primary stages:

  1. Self-Supervised General Adaptation: This initial stage focuses on aligning the representation spaces of the MT encoder and the LLM. The process uses a sequence-level alignment objective where new trainable parameters (a projection matrix and LoRA adapters) are optimized to map MT encoder outputs to the LLM's input embedding space. This enables the LLM to understand multilingual representations generated by the MT encoder.
  2. Task-Specific Distillation: In this stage, the model undergoes task-specific fine-tuning. Initially, the LLM is fine-tuned on labeled task data, and subsequently, this task-specific knowledge is transferred to the MT-LLM hybrid by aligning the output representations of the task-tuned LLM and the MT-LLM.

Experimental Setup and Results

Tasks and Languages

The authors evaluated the proposed MT-LLM across three NLU tasks:

  1. Natural Language Inference (NLI): Evaluated on XNLI, AmericasNLI, and Kardeş-NLU datasets.
  2. Sentiment Classification: Evaluated on the NusaX dataset, covering 10 Indonesian languages.
  3. Multiple-Choice Machine Reading Comprehension (MRC): Evaluated on the Belebele benchmark, which includes 122 languages.

Cross-Lingual Transfer Setups

The paper employed two standard cross-lingual transfer setups:

  1. Zero-Shot Cross-Lingual Transfer (ZS-XLT): The model is fine-tuned on English training data and evaluated directly on target language instances.
  2. Translate-Test: Target language instances are translated into English before being processed by the LLM.

Numerical Results

The MT-LLM significantly outperformed both standard LLMs and the standalone NLLB encoder in cross-lingual NLU tasks. Notably, the MT-LLM exhibited an average accuracy of 81.4% on XNLI and 82.1% on Kardeş-NLU, showing substantial improvements over standard LLMs and MT models. The results demonstrate that the MT-LLM approach surpasses the translate-test setup, achieving better performance and reducing inference overhead by eliminating the need for MT decoding.

Discussion

The paper sheds light on the computational efficiency of the proposed self-distillation method, which requires only a few thousand training steps to achieve significant alignment between the MT and LLM backbones. This efficiency is crucial given the extensive computational resources typically required for training such models.

Implications and Future Work

The integration of MT encoders into LLMs through self-distillation holds considerable promise for improving multilingual capabilities in NLU tasks. By extending LLMs' access to the rich multilingual representations of MT encoders, this approach mitigates the constraints posed by typological differences and low-resource language representations.

Future research could explore the inclusion of token-level alignment objectives to further enhance the alignment and generalization capabilities of MT-LLMs. Additionally, extending this approach to support even more languages through post-hoc adaptation of both LLM and MT encoders may yield further gains in cross-lingual NLU performance.

Conclusion

This paper introduces a novel and effective method to enhance the cross-lingual NLU capabilities of LLMs by integrating MT encoders through self-distillation. The resulting MT-LLMs demonstrate superior cross-lingual performance, validating the efficacy of the proposed approach and paving the way for more inclusive and efficient multilingual LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Fabian David Schmidt (11 papers)
  2. Philipp Borchert (7 papers)
  3. Ivan Vulić (130 papers)
  4. Goran Glavaš (82 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com