Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How do Large Language Models Handle Multilingualism? (2402.18815v3)

Published 29 Feb 2024 in cs.CL and cs.AI
How do Large Language Models Handle Multilingualism?

Abstract: LLMs have demonstrated impressive capabilities across diverse languages. This study explores how LLMs handle multilingualism. Based on observed language ratio shifts among layers and the relationships between network structures and certain capabilities, we hypothesize the LLM's multilingual workflow ($\texttt{MWork}$): LLMs initially understand the query, converting multilingual inputs into English for task-solving. In the intermediate layers, they employ English for thinking and incorporate multilingual knowledge with self-attention and feed-forward structures, respectively. In the final layers, LLMs generate responses aligned with the original language of the query. To verify $\texttt{MWork}$, we introduce Parallel Language-specific Neuron Detection ($\texttt{PLND}$) to identify activated neurons for inputs in different languages without any labeled data. Using $\texttt{PLND}$, we validate $\texttt{MWork}$ through extensive experiments involving the deactivation of language-specific neurons across various layers and structures. Moreover, $\texttt{MWork}$ allows fine-tuning of language-specific neurons with a small dataset, enhancing multilingual abilities in a specific language without compromising others. This approach results in an average improvement of $3.6\%$ for high-resource languages and $2.3\%$ for low-resource languages across all tasks with just $400$ documents.

LLMs handle multilingualism through a structured process that leverages different layers of their architecture for understanding, processing, and generating text in multiple languages. This process, as detailed in "How do LLMs Handle Multilingualism?" (Zhao et al., 29 Feb 2024 ), involves a multilingual workflow (MWork) and is validated using Parallel Language-specific Neuron Detection (PLND).

Multilingual Workflow (MWork) in LLMs

The MWork framework posits that LLMs manage multilingual inputs via three distinct stages, each localized to specific layers within the model:

  • Understanding (Initial Layers): The initial layers are responsible for converting multilingual inputs into a unified representation, effectively translating them into English for subsequent processing. This translation facilitates a common ground for task-solving, irrespective of the input language.
  • Task-Solving (Intermediate Layers): The intermediate layers primarily operate in English, utilizing self-attention and feed-forward networks. Self-attention mechanisms are employed for reasoning, while feed-forward networks integrate multilingual knowledge to enrich the factual content. This stage is pivotal for the LLM's ability to "think" and derive solutions.
  • Response Generation (Final Layers): The final layers generate responses in the original language of the query. This involves translating the English-centric thought process back into the user's language, ensuring coherent and contextually relevant outputs.

Parallel Language-specific Neuron Detection (PLND)

To empirically validate the MWork framework, the paper employs PLND, a novel method for identifying and quantifying the significance of individual neurons in relation to the input language, without relying on explicit task labels. PLND involves feeding a free text corpus of a specific language into the model and isolating the neurons that consistently activate.

The PLND method is mathematically defined for both Feed-Forward and Self-Attention layers. For the Feed-Forward Layer in Llama2, the importance of a neuron is quantified as the difference in the output when the specific neuron of WupW_{up} is either activated or deactivated, calculated efficiently in parallel using a diagonal mask matrix. Similarly, for the Self-Attention Layer, the importance is calculated by measuring the difference in the attention weight when a specific neuron in WQW_Q or WKW_K is deactivated, also enabling parallel computation.

Empirical Validation Through Neuron Deactivation

The paper provides empirical evidence by deactivating language-specific neurons in different layers and observing the impact on performance:

  • Understanding Layer: Deactivating language-specific neurons in the understanding layer significantly impairs performance in non-English languages while maintaining English performance. This observation supports the hypothesis that these layers are crucial for processing and translating non-English inputs.
  • Task-Solving Layer: Deactivating language-specific neurons in the task-solving layers reduces performance across all languages, including English. This result corroborates the idea that the task-solving process heavily depends on English. Disabling the self-attention structure impairs the ability to solve tasks across all languages, whereas deactivating language-specific neurons within the feed-forward structure predominantly affects non-English languages.
  • Generation Layer: Deactivating language-specific neurons in the generation layer affects the model's ability to generate outputs in non-English languages, as expected.

Furthermore, the paper reveals that languages from the same family tend to exhibit a higher degree of overlap in their language-specific neurons. English neurons show limited overlap with other languages, underscoring the predominant role of English-specific neurons within the model.

Fine-tuning Language-Specific Neurons

The paper demonstrates that fine-tuning language-specific neurons with a small number of contextual examples can enhance the multilingual capabilities of LLMs. This targeted fine-tuning results in performance improvements, particularly in multilingual understanding and generation. The results show an average improvement of 3.6%3.6\% for high-resource languages and 2.3%2.3\% for low-resource languages across all tasks with just $400$ documents.

In summary, LLMs process multilingual inputs by converting them into a unified representation (often English) in the initial layers, leveraging English for task-solving in the intermediate layers, and generating responses in the original language in the final layers. Techniques like PLND enable the identification and manipulation of language-specific neurons, offering insights into the multilingual capabilities of LLMs and enabling targeted fine-tuning for enhanced performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637.
  2. Language id in the wild: Unexpected challenges on the path to a thousand-language web text corpus. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6588–6608.
  3. Cross-lingual ability of multilingual masked language models: A study of language structure. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, pages 4702–4712.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  6. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.
  7. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, Dublin, Ireland. Association for Computational Linguistics.
  8. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502.
  9. A primer on pretrained multilingual language models. CoRR, abs/2107.00676.
  10. Interpretability illusions in the generalization of simplified models.
  11. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45.
  12. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495.
  13. Xl-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703.
  14. John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. arXiv preprint arXiv:1909.03368.
  15. Towards a mechanistic interpretation of multi-step reasoning capabilities of language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4902–4919, Singapore. Association for Computational Linguistics.
  16. Not all languages are created equal in LLMs: Improving multilingual capability by cross-lingual-thought prompting. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12365–12394, Singapore. Association for Computational Linguistics.
  17. Mistral 7b. arXiv preprint arXiv:2310.06825.
  18. Cross-lingual ability of multilingual BERT: an empirical study. In 8th International Conference on Learning Representations, ICLR 2020.
  19. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341.
  20. Enhancing cross-lingual natural language inference by soft prompting with multilingual verbalizer. arXiv preprint arXiv:2305.12761.
  21. Machine-created universal language for cross-lingual transfer. arXiv preprint arXiv:2305.13071.
  22. Common sense beyond english: Evaluating and improving multilingual language models for commonsense reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1274–1287.
  23. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
  24. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  25. Enhancing cross-lingual transfer via phonemic transcription integration. arXiv preprint arXiv:2307.04361.
  26. OpenAI. 2023. Gpt-4 technical report.
  27. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 63(10):1872–1897.
  28. Pixel representations for multilingual translation and data-efficient cross-lingual transfer. arXiv preprint arXiv:2305.14280.
  29. The truth is in there: Improving reasoning in language models with layer-selective rank reduction. arXiv preprint arXiv:2312.13558.
  30. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations.
  31. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7035–7052, Singapore. Association for Computational Linguistics.
  32. Multilingual llms are better cross-lingual in-context learners with alignment. arXiv preprint arXiv:2305.05940.
  33. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  34. Attention is all you need. Advances in neural information processing systems, 30.
  35. Jesse Vig. 2019. A multiscale visualization of attention in the transformer model.
  36. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. CoRR, abs/2306.05179.
  37. Llama beyond english: An empirical study on language capability transfer.
  38. Extrapolating large language models to non-english by aligning languages.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yiran Zhao (26 papers)
  2. Wenxuan Zhang (75 papers)
  3. Guizhen Chen (11 papers)
  4. Kenji Kawaguchi (147 papers)
  5. Lidong Bing (144 papers)
Citations (27)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com