Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unraveling Babel: Exploring Multilingual Activation Patterns of LLMs and Their Applications (2402.16367v3)

Published 26 Feb 2024 in cs.CL

Abstract: Recently, LLMs have achieved tremendous breakthroughs in the field of NLP, but still lack understanding of their internal neuron activities when processing different languages. We designed a method to convert dense LLMs into fine-grained MoE architectures, and then visually studied the multilingual activation patterns of LLMs through expert activation frequency heatmaps. Through comprehensive experiments on different model families, different model sizes, and different variants, we analyzed the similarities and differences in the internal neuron activation patterns of LLMs when processing different languages. Specifically, we investigated the distribution of high-frequency activated experts, multilingual shared experts, whether multilingual activation patterns are related to language families, and the impact of instruction tuning on activation patterns. We further explored leveraging the discovered differences in expert activation frequencies to guide sparse activation and pruning. Experimental results demonstrated that our method significantly outperformed random expert pruning and even exceeded the performance of unpruned models in some languages. Additionally, we found that configuring different pruning rates for different layers based on activation level differences could achieve better results. Our findings reveal the multilingual processing mechanisms within LLMs and utilize these insights to offer new perspectives for applications such as sparse activation and model pruning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (13)
  1. On the multilingual capabilities of very large-scale English language models. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3056–3068, Marseille, France. European Language Resources Association.
  2. Sunit Bhattacharya and Ondřej Bojar. 2023. Unveiling multilinguality in transformer models: Exploring language specificity in feed-forward networks. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 120–126, Singapore. Association for Computational Linguistics.
  3. Language control in the bilingual brain. Science, 312(5779):1537–1540.
  4. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, Dublin, Ireland. Association for Computational Linguistics.
  5. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  6. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826.
  7. Mikko I Malinen and Pasi Fränti. 2014. Balanced k-means for clustering. In Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshop, S+ SSPR 2014, Joensuu, Finland, August 20-22, 2014. Proceedings, pages 32–41. Springer.
  8. LLaMA-MoE Team. 2023. Llama-moe: Building mixture-of-experts from llama with continual pre-training.
  9. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  10. Speaking in multiple languages: Neural correlates of language proficiency in multilingual word production. Brain and language, 113(3):103–112.
  11. MoEfication: Transformer feed-forward layers are mixtures of experts. In Findings of the Association for Computational Linguistics: ACL 2022, pages 877–890, Dublin, Ireland. Association for Computational Linguistics.
  12. Unveiling a core linguistic region in large language models. arXiv preprint arXiv:2310.14928.
  13. A survey of large language models. arXiv preprint arXiv:2303.18223.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Weize Liu (5 papers)
  2. Yinlong Xu (18 papers)
  3. Hongxia Xu (24 papers)
  4. Jintai Chen (57 papers)
  5. Xuming Hu (120 papers)
  6. Jian Wu (314 papers)