Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models (2404.12464v7)

Published 18 Apr 2024 in cs.CL
NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models

Abstract: To be effectively and safely deployed to global user populations, LLMs must adapt outputs to user values and culture, not just know about them. We introduce NormAd, an evaluation framework to assess LLMs' cultural adaptability, specifically measuring their ability to judge social acceptability across different levels of cultural norm specificity, from abstract values to explicit social norms. As an instantiation of our framework, we create NormAd-Eti, a benchmark of 2.6k situational descriptions representing social-etiquette related cultural norms from 75 countries. Through comprehensive experiments on NormAd-Eti, we find that LLMs struggle to accurately judge social acceptability across these varying degrees of cultural contexts and show stronger adaptability to English-centric cultures over those from the Global South. Even in the simplest setting where the relevant social norms are provided, our best models' performance (<82%) lags behind humans (>95%). In settings with abstract values and country information, model performance drops substantially (<60%), while human accuracy remains high (>90%). Furthermore, we find that models are better at recognizing socially acceptable versus unacceptable situations. Our findings showcase the current pitfalls in socio-cultural reasoning of LLMs which hinder their adaptability for global audiences.

Evaluating the Cultural Adaptability of LLMs with the NormAd Dataset

Introduction to NormAd Dataset

In this paper, the authors introduce NormAd, a novel dataset designed to rigorously assess the cultural adaptability of LLMs. It contains 2.6k stories that operationalize cultural norms across 75 countries for a comprehensive evaluation. Each story in NormAd is accompanied by question-answer pairs to measure a model's ability to handle normative social acceptability under different cultural contexts.

Key Findings

The authors present several key findings:

  1. Model Performance in Different Contexts: LLMs exhibit difficulties across all contextual granularities, particularly with non-English-centric cultural norms. Notably, even the top-performing models like Mistral-7b-Instruct only reach an accuracy up to 81.8%, considerably below human performance which stands at 95.6%.
  2. Accuracy Across Cultural Norms: LLMs show marked deficiencies in adapting outputs suitable for culturally diverse contexts. The struggle is pronounced in scenarios involving norm violations and culturally distinct practices like gift-giving across different cultures.
  3. Bias Identification: The models demonstrate bias towards verifying the acceptability of stories adhering to cultural norms rather than identifying deviations, underlining the presence of inherent agreement biases in current LLM setups.

Dataset Construction and Validation

Narrative Generation: Leveraging the Cultural Atlas, the researchers have meticulously generated narrative stories encapsulating daily scenarios influenced by specific Rules of Thumb (RoT), broad Value paradigms, and Country-specific information.

Validation Methods: The dataset underwent robust automated and manual validation processes to ensure the relevance and cultural accuracy of the narratives, encompassing various checks including relevance of RoT to the stories, and the entailment between Values and RoTs.

Experimental Results

In detailed experiments using the NormAd dataset, the results indicate:

  • Contextualization Challenges: Models have shown lower accuracy scores when dealing with broader Value and specific Country contexts compared to the more detailed RoT context, which presents all necessary information directly.
  • Parameter Effect: There is a slight improvement in performance with increased model parameters; however, this is not linear and shows diminishing returns at higher scales.
  • Cultural Performance Discrepancy: There is a noticeable performance disparity across cultures, where models tend to perform better on narratives based on Western norms compared to those from non-Western countries such as those in the African-Islamic cultural zones.

Theoretical and Practical Implications

Theoretical Implications: The findings challenge the robustness and the claimed universality of LLMs, underscoring the significant need for models that can genuinely understand and adapt to the cultural complexities of global user bases in an equitable manner.

Practical Implications: Practically, the results advocate for a reconsideration of how cultural adaptability is integrated and evaluated in LLMs, suggesting that merely increasing model size or relying on current training methods may not adequately address the biases and performance issues observed.

Future Research Directions

The authors propose a focus on enhancing cultural reasoning capabilities within LLMs by improving contextual understanding and adaptability during both training and inference. Future research could explore more dynamic and contextually aware training methodologies and perhaps multilingual and multicultural integration to better reflect global diversity.

Conclusion

Overall, this paper provides a critical look at the current limitations of LLMs in handling cultural diversity through the lens of the new, comprehensive NormAd dataset. It sets a benchmark for future research aimed at creating more culturally competent and globally equitable AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Investigating cultural alignment of large language models, 2024.
  2. Which humans?, Sep 2023. URL osf.io/preprints/psyarxiv/5b26t.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  4. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pp.  610–623, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922.
  5. Assessing llms for moral value pluralism. arXiv preprint arXiv:2312.10075, 2023.
  6. Culturalteaming: Ai-assisted interactive red-teaming for challenging llms’ (lack of) multicultural knowledge. 2024. URL https://api.semanticscholar.org/CorpusID:269032901.
  7. Eticor: Corpus for analyzing llms for etiquettes, 2023.
  8. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
  9. The cultural atlas. https://culturalatlas.sbs.com.au/, 2024.
  10. Directions for NLP practices applied to online hate speech detection. In EMNLP, pp.  11794–11805, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.809.
  11. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  12. Reporting bias and knowledge acquisition. In Proceedings of the 2013 workshop on Automated knowledge base construction, pp.  25–30, 2013.
  13. Cultural Aspects of Gift Giving: A Comparative Analysis of the Significance of Gift Giving in the U.S. and Japan. In Proceedings of the 1997 World Marketing Congress, pp.  283–287. Springer, Cham, Switzerland, 2015. ISBN 978-3-319-17320-7. doi: 10.1007/978-3-319-17320-7˙78.
  14. Trust, comfort and relatability: Understanding black older adults’ perceptions of chatbot design for health information seeking. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394215. doi: 10.1145/3544548.3580719. URL https://doi.org/10.1145/3544548.3580719.
  15. Aligning ai with shared human values, 2023.
  16. Geert Hofstede. Culture’s Consequences: Comparing Values, Behaviors, Institutions and Organizations Across Nations. SAGE Publications, Inc, 1980.
  17. The cultural map of the world. https://www.worldvaluessurvey.org/WVSContents.jsp?CMSID=Findings, 2023. [Accessed 03-29-2024].
  18. The ghost in the machine has an american accent: value conflict in gpt-3. arXiv preprint arXiv:2203.07785, 2022.
  19. Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback, 2023.
  20. The Current Formulation of Kohlberg’s Theory and a Response to Critics. Human Development, 28(2):94–100, 12 2009. ISSN 0018-716X. doi: 10.1159/000272945. URL https://doi.org/10.1159/000272945.
  21. Culturellm: Incorporating cultural differences into large language models. arXiv preprint arXiv:2402.10946, 2024.
  22. Cultural alignment in large language models: An explanatory analysis based on hofstede’s cultural dimensions. ArXiv, abs/2309.12342, 2023. URL https://api.semanticscholar.org/CorpusID:262216989.
  23. Andrew Molinsky. Cross-cultural code-switching: The psychological challenges of adapting behavior in foreign cultural interactions. Academy of management review, 32(2):622–640, 2007.
  24. Having beer after prayer? measuring cultural bias in large language models. ArXiv, abs/2305.14456, 2023. URL https://api.semanticscholar.org/CorpusID:258865272.
  25. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  26. Discovering language model behaviors with model-written evaluations, 2022.
  27. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  28. Knowledge of cultural moral norms in large language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  428–446, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.26. URL https://aclanthology.org/2023.acl-long.26.
  29. Ethical reasoning over moral alignment: A case and framework for in-context ethical policies in LLMs. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  13370–13388, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.892. URL https://aclanthology.org/2023.findings-emnlp.892.
  30. Sharon Richardson. Against generalisation: Data-driven decisions need context to be human-compatible. Business Information Review, 38(4):162–169, December 2021. URL https://doi.org/10.1177/02663821211061986.
  31. Unintended impacts of llm alignment on global representation. arXiv preprint arXiv:2402.15018, 2024.
  32. The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics, pp.  1668–1678, 2019.
  33. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  34. Shalom H. Schwartz. An overview of the schwartz theory of basic values. Online Readings in Psychology and Culture, 2, 12 2012. doi: 10.9707/2307-0919.1116.
  35. Bernd Stauss. Gifts and Culture: What Applies Globally and What Regionally? In Psychology of Gift-Giving, pp.  161–176. Springer, Berlin, Germany, January 2023. ISBN 978-3-662-66393-6. doi: 10.1007/978-3-662-66393-6˙13.
  36. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  37. Trustllm: Trustworthiness in large language models, 2024.
  38. Phantom: Personality has an effect on theory-of-mind reasoning in large language models, 2024.
  39. Probing the moral development of large language models through defining issues test, 2023.
  40. Designing for harm reduction: Communication repair for multicultural users’ voice interactions, 2024.
  41. WVS. WVS Database — worldvaluessurvey.org. https://www.worldvaluessurvey.org/wvs.jsp, 1981. [Accessed 07-12-2023].
  42. Value fulcra: Mapping large language models to the multidimensional spectrum of basic human values, 2023.
  43. “don’t take this out of context!” on the need for contextual models and evaluations for stylistic rewriting. In EMNLP, 2023. URL https://arxiv.org/abs/2305.14755.
  44. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Abhinav Rao (8 papers)
  2. Akhila Yerukola (14 papers)
  3. Vishwa Shah (6 papers)
  4. Katharina Reinecke (15 papers)
  5. Maarten Sap (86 papers)
Citations (4)