CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge (2404.06664v1)
Abstract: Frontier LLMs are developed by researchers and practitioners with skewed cultural backgrounds and on datasets with skewed sources. However, LLMs' (lack of) multicultural knowledge cannot be effectively assessed with current methods for developing benchmarks. Existing multicultural evaluations primarily rely on expensive and restricted human annotations or potentially outdated internet resources. Thus, they struggle to capture the intricacy, dynamics, and diversity of cultural norms. LLM-generated benchmarks are promising, yet risk propagating the same biases they are meant to measure. To synergize the creativity and expert cultural knowledge of human annotators and the scalability and standardizability of LLM-based automation, we introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build truly challenging evaluation dataset for assessing the multicultural knowledge of LLMs, while improving annotators' capabilities and experiences. Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions, that modern LLMs fail at, in a gamified manner. Importantly, the increased level of AI assistance (e.g., LLM-generated revision hints) empowers users to create more difficult questions with enhanced perceived creativity of themselves, shedding light on the promises of involving heavier AI assistance in modern evaluation dataset creation procedures. Through a series of 1-hour workshop sessions, we gather CULTURALBENCH-V0.1, a compact yet high-quality evaluation dataset with users' red-teaming attempts, that different families of modern LLMs perform with accuracy ranging from 37.7% to 72.2%, revealing a notable gap in LLMs' multicultural proficiency.
- Anthropic. The claude 3 model family: Opus, sonnet, haiku. In Anthropic Model Card, 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
- Probing pre-trained language models for cross-cultural differences in values. In Sunipa Dev, Vinodkumar Prabhakaran, David Adelani, Dirk Hovy, and Luciana Benotti (eds.), Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), pp. 114–130, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.c3nlp-1.12. URL https://aclanthology.org/2023.c3nlp-1.12.
- Language models are few-shot learners, 2020.
- Assessing cross-cultural alignment between ChatGPT and human societies: An empirical study. In Sunipa Dev, Vinodkumar Prabhakaran, David Adelani, Dirk Hovy, and Luciana Benotti (eds.), Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), pp. 53–67, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.c3nlp-1.7. URL https://aclanthology.org/2023.c3nlp-1.7.
- Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024.
- Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470, 2020. URL https://api.semanticscholar.org/CorpusID:212657414.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Evaluation of african american language bias in natural language generation. ArXiv, abs/2305.14291, 2023. URL https://api.semanticscholar.org/CorpusID:258841278.
- Towards measuring the representation of subjective global opinions in language models, 2023.
- Massively multi-cultural knowledge acquisition & lm benchmarking. ArXiv, abs/2402.09369, 2024. URL https://api.semanticscholar.org/CorpusID:267657749.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
- Aligning ai with shared human values. Proceedings of the International Conference on Learning Representations (ICLR), 2021a.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021b.
- Challenges and strategies in cross-cultural nlp. arXiv preprint arXiv:2203.10020, 2022.
- Culturally aware natural language inference. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 7591–7609, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Dynabench: Rethinking benchmarking in nlp. arXiv preprint arXiv:2104.14337, 2021.
- Culturellm: Incorporating cultural differences into large language models. ArXiv, abs/2402.10946, 2024. URL https://api.semanticscholar.org/CorpusID:267750997.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- Mkqa: A linguistically diverse benchmark for multilingual open domain question answering. Transactions of the Association for Computational Linguistics, 9:1389–1406, 2021.
- Global voices, local biases: Socio-cultural prejudices across languages. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:264491136.
- Having beer after prayer? measuring cultural bias in large language models. ArXiv, abs/2305.14456, 2023. URL https://api.semanticscholar.org/CorpusID:258865272.
- Extracting cultural commonsense knowledge at scale. In Proceedings of the ACM Web Conference 2023, pp. 1907–1917, 2023.
- R OpenAI. Gpt-4 technical report. ArXiv, 2303, 2023.
- Aart: Ai-assisted red-teaming with diverse data generation for new llm-powered applications. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:265213125.
- Knowledge of cultural moral norms in large language models. arXiv preprint arXiv:2306.01857, 2023.
- Whose opinions do language models reflect?, 2023.
- Nlpositionality: Characterizing design biases of datasets and models. arXiv preprint arXiv:2306.01943, 2023.
- Convxai: Delivering heterogeneous ai explanations via conversations to support human-ai scientific writing. In Companion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing, pp. 384–387, 2023.
- Commonsenseqa 2.0: Exposing the limits of ai through gamification. arXiv preprint arXiv:2201.05320, 2022.
- Auditing and mitigating cultural bias in llms. ArXiv, abs/2311.14096, 2023. URL https://api.semanticscholar.org/CorpusID:265445838.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Copal-id: Indonesian language reasoning with local culture and nuances. ArXiv, abs/2311.01012, 2023. URL https://api.semanticscholar.org/CorpusID:264935209.
- Llms as workers in human-computational algorithms? replicating crowdsourcing pipelines with llms. arXiv preprint arXiv:2307.10168, 2023.
- Rethinking benchmark and contamination for language models with rephrased samples. ArXiv, abs/2311.04850, 2023. URL https://api.semanticscholar.org/CorpusID:265050721.
- Empowering llm-based machine translation with cultural awareness. ArXiv, abs/2305.14328, 2023. URL https://api.semanticscholar.org/CorpusID:258841694.
- Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv, abs/2306.05685, 2023. URL https://api.semanticscholar.org/CorpusID:259129398.
- Normbank: A knowledge bank of situational social norms. arXiv preprint arXiv:2305.17008, 2023.
- Yu Ying Chiu (9 papers)
- Liwei Jiang (53 papers)
- Maria Antoniak (20 papers)
- Chan Young Park (20 papers)
- Shuyue Stella Li (22 papers)
- Mehar Bhatia (11 papers)
- Sahithya Ravi (15 papers)
- Yulia Tsvetkov (142 papers)
- Vered Shwartz (49 papers)
- Yejin Choi (287 papers)