Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models (2406.16135v1)
Abstract: LLMs are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, effectively being crosslingual? This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks. We observe that while these models show promising surface-level crosslingual abilities on machine translation and embedding space analyses, they struggle with deeper crosslingual knowledge transfer, revealing a crosslingual knowledge barrier in both general (MMLU benchmark) and domain-specific (Harry Potter quiz) contexts. We observe that simple inference-time mitigation methods offer only limited improvement. On the other hand, we propose fine-tuning of LLMs on mixed-language data, which effectively reduces these gaps, even when using out-of-domain datasets like WikiText. Our findings suggest the need for explicit optimization to unlock the full crosslingual potential of LLMs. Our code is publicly available at https://github.com/google-research/crosslingual-knowledge-barriers.
- Lynn Chua (16 papers)
- Badih Ghazi (78 papers)
- Yangsibo Huang (40 papers)
- Pritish Kamath (48 papers)
- Ravi Kumar (146 papers)
- Pasin Manurangsi (127 papers)
- Amer Sinha (11 papers)
- Chulin Xie (27 papers)
- Chiyuan Zhang (57 papers)