Papers
Topics
Authors
Recent
Search
2000 character limit reached

SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning

Published 9 Sep 2023 in cs.CL and cs.AI | (2309.04766v5)

Abstract: We present SeaEval, a benchmark for multilingual foundation models. In addition to characterizing how these models understand and reason with natural language, we also investigate how well they comprehend cultural practices, nuances, and values. Alongside standard accuracy metrics, we investigate the brittleness of foundation models in the dimensions of semantics and multilinguality. Our analyses span both open-sourced and closed models, leading to empirical results across classic NLP tasks, reasoning, and cultural comprehension. Key findings indicate (1) Most models exhibit varied behavior when given paraphrased instructions. (2) Many models still suffer from exposure bias (e.g., positional bias, majority label bias). (3) For questions rooted in factual, scientific, and commonsense knowledge, consistent responses are expected across multilingual queries that are semantically equivalent. Yet, most models surprisingly demonstrate inconsistent performance on these queries. (4) Multilingually-trained models have not attained "balanced multilingual" capabilities. Our endeavors underscore the need for more generalizable semantic representations and enhanced multilingual contextualization. SeaEval can serve as a launchpad for more thorough investigations and evaluations for multilingual and multicultural scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. MEGA: Multilingual evaluation of generative AI. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, Singapore. Association for Computational Linguistics.
  2. Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181.
  3. BIG bench authors. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  4. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  5. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
  6. Z-bench. https://github.com/zhenbench/z-bench.
  7. DialogSum: A real-life scenario dialogue summarization dataset. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 5062–5074, Online. Association for Computational Linguistics.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  9. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  11. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  12. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.
  13. David Deterding. 2007. Singapore English. Edinburgh University Press.
  14. GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, Dublin, Ireland. Association for Computational Linguistics.
  15. Mitigating label biases for in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14014–14031, Toronto, Canada. Association for Computational Linguistics.
  16. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China. Association for Computational Linguistics.
  17. Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation. arXiv preprint arXiv:2306.05783.
  18. The FLORES evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. In Proceedings of the EMNLP-IJCNLP 2019, pages 6098–6111, Hong Kong, China. Association for Computational Linguistics.
  19. Risks and NLP design: A case study on procedural document QA. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1248–1269, Toronto, Canada. Association for Computational Linguistics.
  20. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  21. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
  22. OCNLI: Original Chinese Natural Language Inference. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3512–3526, Online. Association for Computational Linguistics.
  23. Not all languages are created equal in llms: Improving multilingual capability by cross-lingual-thought prompting. arXiv preprint arXiv:2305.07004.
  24. Cosmos QA: machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 2391–2401. Association for Computational Linguistics.
  25. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Advances in Neural Information Processing Systems.
  26. Is it culture or is it language? Examination of language effects in cross-cultural research on categorization. Journal of personality and social psychology, 87(1):57.
  27. MERIt: Meta-Path Guided Contrastive Learning for Logical Reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3496–3509, Dublin, Ireland. Association for Computational Linguistics.
  28. Logicllm: Exploring self-supervised logic-enhanced training for large language models. CoRR, abs/2305.13718.
  29. Claire Kramsch. 1991. Culture in language learning: A view from the united states. Foreign language research in cross-cultural perspective, 2:217–240.
  30. Claire Kramsch. 2014. Language and culture. AILA review, 27(1):30–55.
  31. Cross-lingual alignment methods for multilingual BERT: A comparative study. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 933–942, Online. Association for Computational Linguistics.
  32. RACE: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 785–794. Association for Computational Linguistics.
  33. ChatGPT beyond English: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv preprint arXiv:2304.05613.
  34. CMMLU: Measuring massive multitask language understanding in chinese.
  35. Peggy Li and Lila Gleitman. 2002. Turning the tables: Language and spatial reasoning. Cognition, 83(3):265–294.
  36. AlpacaEval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  37. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  38. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  39. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  40. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  41. Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2947–2962.
  42. Singlish message paraphrasing: A joint task of creole translation and text normalization. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3924–3936.
  43. Robert K Logan. 1986. The alphabet effect. New York: Morrow.
  44. John McCarthy. 2022. Artificial intelligence, logic, and formalising common sense. Machine Learning and the City: Applications in Architecture and Urban Design, pages 69–90.
  45. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada. Association for Computational Linguistics.
  46. Nils J. Nilsson. 1991. Logic and artificial intelligence. Artif. Intell., 47:31–56.
  47. R OpenAI. 2023. GPT-4 technical report. arXiv, pages 2303–08774.
  48. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  49. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  50. Alastair Pennycook. 2006. Global Englishes and transcultural flows. routledge.
  51. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  52. Emotion classification on indonesian twitter dataset. In 2018 International Conference on Asian Language Processing (IALP), pages 90–95. IEEE.
  53. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  54. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations.
  55. DREAM: A challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics, 7:217–231.
  56. Investigating prior knowledge for challenging Chinese machine reading comprehension. Transactions of the Association for Computational Linguistics, 8:141–155.
  57. Stanford Alpaca: An instruction-following llama model.
  58. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  59. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  60. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
  61. Bin Wang and Haizhou Li. 2023. Relational sentence embedding for flexible semantic matching. In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023), pages 238–252, Toronto, Canada. Association for Computational Linguistics.
  62. Instructive dialogue summarization with query aggregations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7630–7653, Singapore. Association for Computational Linguistics.
  63. Voyager: An open-ended embodied agent with large language models. CoRR, abs/2305.16291.
  64. An overview of language models: Recent developments and outlook. APSIPA Transactions on Signal and Information Processing, 13(2).
  65. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
  66. IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 843–857, Suzhou, China. Association for Computational Linguistics.
  67. From word models to world models: Translating from natural language to the probabilistic language of thought. CoRR, abs/2306.12672.
  68. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  69. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  70. Language versatilists vs. specialists: An empirical revisiting on multilingual transfer ability. arXiv preprint arXiv:2306.06688.
  71. Natural language processing for similar languages, varieties, and dialects: A survey. Natural Language Engineering, 26(6):595–612.
  72. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  73. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. arXiv preprint arXiv:2306.05179.
  74. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.
  75. Judging llm-as-a-judge with mt-bench and chatbot arena.
  76. Agieval: A human-centric benchmark for evaluating foundation models.
  77. Multilingual machine translation with large language models: Empirical results and analysis. arXiv preprint arXiv:2304.04675.
  78. Extrapolating large language models to non-english by aligning languages. arXiv preprint arXiv:2308.04948.
Citations (46)

Summary

  • The paper presents SeaEval as a comprehensive benchmark evaluating multilingual foundation models' cross-lingual consistency and cultural reasoning.
  • It introduces novel evaluation protocols such as instruction sensitivity and label shuffling to detect performance biases across languages.
  • The findings highlight current models' limitations in consistent multilingual performance, urging enhancements in training data and evaluation strategies.

Summary of SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning

The paper "SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning" introduces the SeaEval benchmark, specifically designed to evaluate the capabilities of multilingual foundation models (MFMs). The authors present a comprehensive evaluation framework encompassing various aspects of multilingual and multicultural understanding, highlighting the significant challenges posed by effective cross-lingual knowledge transfer and cultural comprehension.

Key Components of SeaEval

  1. Multicultural and Multilingual Understanding: The benchmark incorporates a range of datasets aimed at evaluating models' capabilities in interacting with and understanding cultural contexts. This includes newly constructed datasets focusing on cultural aspects from regions such as the United States, Singapore, China, and the Philippines. The inclusion of Singlish translation tasks further enhances the cultural dimension, highlighting the model's need to adapt to linguistic diversity.
  2. Cross-Lingual Consistency: SeaEval emphasizes the often-overlooked issue of consistent performance across semantically equivalent queries in different languages. The authors illustrate through empirical results that many MFMs exhibit notable inconsistencies in answering such queries, contradicting the optimal expectation of semantically generalized representations.
  3. Complex Reasoning and NLP Tasks: The benchmark covers traditional NLP tasks and complex reasoning scenarios, incorporating datasets tailored for assessing intricate reasoning processes in different languages. This serves as a rigorous testbed for evaluating both the linguistic understanding and problem-solving capabilities of MFMs.

Evaluation Protocols and Metrics

To comprehensively assess MFMs, SeaEval introduces novel evaluation protocols:

  • Instruction Sensitivity: This considers the robustness of models to varied instructional phrasing, addressing the potential for performance biases due to different prompt formulations.
  • Exposure Bias in Label Arrangements: By shuffling labels, the benchmark reveals inherent biases in label positioning, prompting a need for more robust model evaluation mechanisms.
  • Cross-Lingual Consistency Metrics (AC3): The authors propose a consistency score, rewarding models for producing uniform answers across multiple languages, thus encouraging cross-lingual alignment.

Key Findings

The experimental results from SeaEval highlight several critical insights:

  • Instruction Sensitivity: Models like LLaMA-2 and ChatGPT show varying degrees of sensitivity to instruction phrasing, affecting evaluation outcomes significantly.
  • Exposure Bias: Many models still display biases linked to label order, emphasizing the need for more sophisticated evaluation strategies.
  • Inconsistent Multilingual Performance: Despite advancements, MFMs often fail to maintain consistent performance across multiple languages, particularly for low-resource languages, underscoring ongoing challenges in multilingual context generalization.
  • Cultural Comprehension: While models like GPT-4 achieve superior results across cultural reasoning tasks, there remains a gap in effectively embedding and aligning diverse cultural nuances across models, suggesting a need for more targeted training data and methodologies.

Implications and Future Directions

The findings from SeaEval reveal the limitations of current MFMs in achieving balanced multilingual proficiency and robust cultural understanding. Practically, this calls for increased efforts in training methodologies, linguistic diversity in pre-training data, and enhanced cross-lingual alignment strategies. Theoretically, it suggests avenues for research into more generalized semantic representations that can seamlessly transition across languages and cultural contexts.

Overall, the SeaEval benchmark provides a rigorous framework for evaluating multilingual foundation models, sparking discussion and innovation in the quest for more holistic and culturally aware AI systems. As the field continues to evolve, the insights from this work will be invaluable in steering both practical implementations and foundational research in the development of future multilingual AI systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 123 likes about this paper.