Understanding and Mitigating Language Confusion in LLMs (2406.20052v2)

Published 28 Jun 2024 in cs.CL

Abstract: We investigate a surprising limitation of LLMs: their inability to consistently generate text in a user's desired language. We create the Language Confusion Benchmark (LCB) to evaluate such failures, covering 15 typologically diverse languages with existing and newly-created English and multilingual prompts. We evaluate a range of LLMs on monolingual and cross-lingual generation reflecting practical use cases, finding that Llama Instruct and Mistral models exhibit high degrees of language confusion and even the strongest models fail to consistently respond in the correct language. We observe that base and English-centric instruct models are more prone to language confusion, which is aggravated by complex prompts and high sampling temperatures. We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning. We release our language confusion benchmark, which serves as a first layer of efficient, scalable multilingual evaluation at https://github.com/for-ai/language-confusion.

PDF HTML Abstract

Understanding and Mitigating Language Confusion in LLMs

The paper "Understanding and Mitigating Language Confusion in LLMs" by Kelly Marchisio, Wei-Yin Ko, Alexandre Bérard, Théo Dehaze, and Sebastian Ruder provides a comprehensive analysis of an identified limitation in LLMs: their inconsistency in generating text in the user's desired language. The authors introduce the Language Confusion Benchmark (LCB) to evaluate and address this limitation. The paper covers 15 typologically diverse languages and tests various LLMs using monolingual and cross-lingual prompts, presenting insightful findings and proposing strategies to mitigate language confusion.

Key Contributions

The paper makes several notable contributions:

Identification of Language Confusion: The authors pinpoint a significant issue termed "language confusion," where LLMs generate outputs in unintended languages. They categorize this confusion into word-level and line-level errors.
Language Confusion Benchmark (LCB): The authors create the LCB to assess language confusion across multiple languages and generate both existing and new prompts for this purpose. The benchmark encompasses typologically diverse languages, ensuring comprehensive evaluation.
Systematic Evaluation: The paper systematically evaluates popular LLMs such as Llama, Command R, Mistral, and OpenAI models. The evaluation highlights the susceptibility of different models to language confusion under various conditions.
Mitigation Strategies: The paper proposes several methods to alleviate language confusion, including few-shot prompting, multilingual supervised fine-tuning (SFT), and preference tuning.

Methodology and Findings

Evaluation Metrics

Two primary evaluation metrics are introduced: Line-level Pass Rate (LPR) and Word-level Pass Rate (WPR). LPR measures the percentage of responses that are correct in the intended language, based on line-level detection. WPR evaluates the correct use of words in the desired language for non-Latin scripts.

Results

The evaluation reveals that Llama Instruct and Mistral models exhibit significant language confusion, while models like Command R and GPT-3.5 Turbo fare better, particularly in monolingual settings. However, even the strongest models struggle with cross-lingual generation, highlighting a gap in current LLM capabilities.

| Model | Monolingual LPR (%) | Cross-lingual LPR (%) | |||--| | Llama 2 70B-I | 48.3 | 38.4 | | Llama 3 70B-I | 46.0 | 30.3 | | Mixtral 8x7B | 73.0 | 69.0 | | Mistral Large | 69.9 | 58.2 | | Command R | 98.6 | 68.1 | | Command R+ | 99.2 | 91.2 | | GPT-3.5 Turbo | 99.1 | 89.8 | | GPT-4 Turbo | 99.3 | 90.3 |

Mitigation Techniques

The authors propose several strategies to mitigate language confusion:

Reducing Temperature and Nucleus Size: Lowering the temperature and nucleus size during sampling reduces the likelihood of selecting incorrect language tokens.
Few-shot Prompting: Demonstrates positive effects, especially on non-instruction-tuned LLMs. It significantly reduces language confusion by guiding models with correct demonstrations.
Multilingual Instruction Tuning: Even limited multilingual tuning can drastically reduce language confusion. However, preference tuning results in varied outcomes.

Implications and Future Directions

Practical Implications

Enhanced Utility for Multilingual Users: Addressing language confusion can significantly enhance the user experience for non-English speakers and make AI applications more inclusive.
Better Model Training Protocols: The findings emphasize the need for more balanced and multilingual training datasets to reduce English-centric bias in LLMs.

Theoretical Implications

The paper offers insights into the behavior of LLMs under multilingual contexts, particularly how pre-training and fine-tuning affect language generation. This understanding could guide future research on model architectures and training regimes.

Future Developments

Conversational Contexts: Extending the analysis to multi-turn conversations will be crucial, as language switching behaviors may differ in such settings.
Cross-lingual Contextual Inputs: Investigating LLMs’ performance on tasks requiring contextual understanding across multiple languages will offer deeper insights.
Fine-grained Control: Developing techniques for finer control over language generation at both sentence and word levels.
Language Variety Inclusion: Evaluating models on language varieties and dialects will further ensure that AI applications are truly global.

In conclusion, the paper presents a rigorous analysis of language confusion in LLMs and offers practical solutions to mitigate this issue. The proposed LCB provides a foundation for future evaluations and improvements in multilingual LLMs. This research not only enhances our understanding of LLM behavior but also paves the way for more inclusive and robust AI systems.