This paper investigates whether multilingual LLMs (mLLMs) possess culturally diverse reasoning abilities, using proverbs and sayings as a proxy for cultural common ground (Liu et al., 2023 ). The authors argue that since language and culture are intertwined, mLLMs serving diverse communities should be able to reason and communicate in culturally relevant ways. They pose three main research questions: (1) Do mLLMs embed knowledge of cultural common ground, and does it affect reasoning? (2) Can mLLMs reason in contexts requiring understanding cultural common ground? (3) Can mLLMs reason cross-culturally (using translations), and are there "culture gaps"?
To address these questions, the researchers created the MAPS (Multicultural Proverbs and Sayings) dataset. This dataset covers six languages (English, German, Russian, Bengali, Mandarin Chinese, Indonesian) and includes:
- A collection of proverbs and sayings for each language.
- Short conversational contexts where each proverb is used naturally.
- A binary-choice inference task for each proverb-context pair, asking "What does the person mean by {proverb}?", with one correct and one incorrect interpretation provided.
- Labels indicating whether the proverb's usage in the context is figurative or literal.
The dataset was created using a model-in-the-loop approach (GPT-3.5) to generate initial conversational contexts, which were then heavily revised or rewritten by native speakers to ensure naturalness and correct proverb usage. The final MAPS dataset contains 2313 instances. Analysis using multilingual sentence embeddings (LaBSE) showed that proverbs cluster distinctly by language/culture, highlighting their cultural diversity.
A suite of experiments was conducted on various open-source mLLMs (XLM-R, mT0, BLOOMZ, XGLM) and LLaMA-2 models of different sizes using zero-shot prompting with English templates. The experiments focused on:
- Memorization: Assessing if models could complete a proverb when the last word was masked or removed. This was used as an indicator of whether the model had likely seen the proverb during pretraining.
- Reasoning: Evaluating performance on the MAPS inference task (selecting the correct interpretation A or B based on context) by comparing output logits for 'A' and 'B'.
- Figurative vs. Literal Reasoning: Comparing reasoning performance on proverbs labeled as figurative versus literal.
- Negative Question Reasoning: Testing robustness by changing the prompt question to "What does the person not mean by the proverb?", requiring the model to select the incorrect interpretation.
- Cross-Cultural Gap: Evaluating performance on Chinese proverbs translated into English (both machine translation (MT) and human-adapted translation (HT)) to isolate language gaps from cultural understanding gaps.
Key findings include:
- Memorization Varies: mLLMs memorize proverbs to different extents, scaling with model size but showing significant bias towards English and Chinese, with lower performance on Indonesian, Bengali, and Russian.
- Memorization ≠ Reasoning: Higher memorization rates did not consistently correlate with better performance on the contextual reasoning task. Model architecture had less impact than scale.
- Figurative Proverbs are Harder: Models generally struggled more with figurative proverbs compared to literal ones across most languages, although the opposite was observed for Chinese.
- Negative Questions Cause Failure: Performance drastically dropped across almost all models when asked to identify the incorrect meaning. Larger models often performed worse on this task than smaller ones (inverse scaling), and performance gaps between languages widened. This suggests a strong bias towards selecting the "correct" or expected answer, potentially due to training data patterns.
- Culture Gaps Exist: When testing on translated Chinese proverbs (Zh -> En), even the best models (mT0-XXL, LLaMA-2 13B) showed a performance drop compared to their performance on original English data. Human-adapted translations (HT) improved over machine translations (MT), indicating a language gap. However, a gap remained between HT performance and the model's performance on the original target language (English), defined as the "culture gap", suggesting difficulties in reasoning about cultural concepts even when language barriers are reduced.
The paper concludes that while mLLMs show some ability to recognize proverbs, their capacity for culturally nuanced reasoning in context is limited. Memorization doesn't guarantee understanding, figurative language is challenging, handling negation/negative questions is poor, and significant culture gaps persist even with translation. The work highlights the need for improving the cultural diversity and robustness of mLLMs. The MAPS dataset is released to facilitate further research.