Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning (2304.05613v1)

Published 12 Apr 2023 in cs.CL and cs.AI

Abstract: Over the last few years, LLMs have emerged as the most important breakthroughs in NLP that fundamentally transform research and developments in the field. ChatGPT represents one of the most exciting LLM systems developed recently to showcase impressive skills for language generation and highly attract public attention. Among various exciting applications discovered for ChatGPT in English, the model can process and generate texts for multiple languages due to its multilingual training data. Given the broad adoption of ChatGPT for English in different problems and areas, a natural question is whether ChatGPT can also be applied effectively for other languages or it is necessary to develop more language-specific technologies. The answer to this question requires a thorough evaluation of ChatGPT over multiple tasks with diverse languages and large datasets (i.e., beyond reported anecdotes), which is still missing or limited in current research. Our work aims to fill this gap for the evaluation of ChatGPT and similar LLMs to provide more comprehensive information for multilingual NLP applications. While this work will be an ongoing effort to include additional experiments in the future, our current paper evaluates ChatGPT on 7 different tasks, covering 37 diverse languages with high, medium, low, and extremely low resources. We also focus on the zero-shot learning setting for ChatGPT to improve reproducibility and better simulate the interactions of general users. Compared to the performance of previous models, our extensive experimental results demonstrate a worse performance of ChatGPT for different NLP tasks and languages, calling for further research to develop better models and understanding for multilingual learning.

Analysis of Multilingual Capabilities in ChatGPT

The paper "ChatGPT Beyond English: Towards a Comprehensive Evaluation of LLMs in Multilingual Learning" offers a granular evaluation of ChatGPT's performance across a wide range of NLP tasks in multiple languages. The paper investigates whether multilingual LLMs such as ChatGPT exhibit competent performance across diverse languages or if the development of models tailored to specific languages is more viable.

The authors have experimented with a total of 37 languages, encompassing high-resource languages like English and Spanish, medium-resource languages like Turkish, and extending to low and extremely low-resource languages such as Bengali and Kyrgyz. The research encompasses seven central NLP tasks: Part-of-Speech (POS) Tagging, Named Entity Recognition (NER), Relation Extraction, Natural Language Inference (NLI), Question Answering (QA), Common Sense Reasoning (CSR), and Summarization. Crucially, the paper evaluates ChatGPT in a zero-shot learning scenario, assessing its proficiency without example-based fine-tuning.

The findings illustrate that ChatGPT generally underperforms in comparison to state-of-the-art models that are trained in a supervised fashion, particularly for semantically complex tasks such as NER, NLI, and QA. For instance, in NER and NLI, significant performance gaps were noted, indicating that current LLMs may not be well-suited as universal solvers across languages without substantial fine-tuning and domain adaptation.

However, for simpler, syntactic-level tasks like POS Tagging, ChatGPT demonstrated competitive performance, indicating strong lower-level grammatical skills. Interestingly, English task descriptions yielded better performance across various languages, suggesting an inherent bias in ChatGPT towards English or that English prompts could potentially serve as effective universal cueing mechanisms due to the lingua franca status of English in training datasets.

Furthermore, ChatGPT's zero-shot performance evidenced variability related to the resource availability of languages, though its capabilities in low-resource settings required further exploration. Some performance aspects—such as the model's response to English prompts over the target language task descriptions—indicate potential directions for improving LLM multilingual robustness.

The paper suggests that while LLMs can handle a broad swath of languages and tasks, they do so with varying degrees of success and significant reliance on cross-lingual knowledge encapsulated in English respect. The ongoing necessity for task-specific models is evident, especially for high-stakes and domain-specific applications where precision and context sensitivity trump the generalist capability of transferring learned abilities across languages.

The implications of this research are manifold: it reflects current limitations and biases in pre-trained LLMs, underscores the cost versus performance trade-off in deploying such models, and serves as a blueprint for future research aimed at harmonizing multilingual competence with state-of-the-art NLP task solvency. Continued efforts must prioritize the inclusion of underrepresented languages to democratize LLM benefits, which would involve refining pre-training data and methods to better encapture a fuller spectrum of linguistic diversity.

Further studies should include explorations with newly announced LLMs such as GPT-4, evaluating additional criteria such as model robustness and bias in multilingual settings, and expanding empirical analyses to enhance our theoretical understanding of LLM capabilities and limitations.

This paper represents a meticulous effort to bridge existing gaps in LLM evaluations, providing a foundational step toward equitable and practical multilingual NLP solutions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Viet Dac Lai (25 papers)
  2. Nghia Trung Ngo (8 papers)
  3. Amir Pouran Ben Veyseh (20 papers)
  4. Hieu Man (4 papers)
  5. Franck Dernoncourt (161 papers)
  6. Trung Bui (79 papers)
  7. Thien Huu Nguyen (61 papers)
Citations (226)