Multilingual Performance Biases of Large Language Models in Education (2504.17720v1)

Published 24 Apr 2025 in cs.CL and cs.AI

Abstract: LLMs are increasingly being adopted in educational settings. These applications expand beyond English, though current LLMs remain primarily English-centric. In this work, we ascertain if their use in education settings in non-English languages is warranted. We evaluated the performance of popular LLMs on four educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and grading translations in six languages (Hindi, Arabic, Farsi, Telugu, Ukrainian, Czech) in addition to English. We find that the performance on these tasks somewhat corresponds to the amount of language represented in training data, with lower-resource languages having poorer task performance. Although the models perform reasonably well in most languages, the frequent performance drop from English is significant. Thus, we recommend that practitioners first verify that the LLM works well in the target language for their educational task before deployment.

Summary

Evaluation of Multilingual Performance Biases in Educational LLMs

The paper "Multilingual Performance Biases of LLMs in Education" provides a comprehensive empirical investigation into the application of LLMs for educational tasks across multiple languages. It addresses the increasingly pertinent issue of how LLMs perform in non-English settings, which is crucial given the global diversity of educational environments. The authors analyze the efficacy of six prominent LLMs—GPT-4o, Gemini 2.0 Flash, Claude 3.7 Sonnet, Llama 3.1 405B, Mistral Large 2407, and Command-A—over four educational tasks: misconception identification, feedback selection, interactive tutoring, and translation grading, across six languages including Hindi, Arabic, Farsi, Telugu, Ukrainian, and Czech, in addition to English.

Methodology and Tasks

The methodology involves benchmarking LLM performance on tasks strategically chosen for their educational relevance. The tasks are designed to possess a substantial language component while allowing for language-invariant evaluation, thus directly addressing the multilingual capabilities of the models:

Misconception Identification: Evaluates the accuracy of LLMs in diagnosing student misconceptions based on incorrect answers to math questions.
Feedback Selection: Assesses the capability of LLMs to select the appropriate feedback from a set of responses.
Interactive Tutoring: Tests the ability of LLMs to engage in a multi-turn dialogue aimed at guiding a student-model to correct solutions.
Translation Grading: Measures the proficiency of LLMs in evaluating language learning exercises through comparison of machine-translated sentences to perturbed versions.

Results

The results indicate a persistent English favoritism, with models generally performing better in English than in other languages. Nevertheless, the disparities vary significantly across tasks and languages. Notably, GPT4o and Gemini exhibit superior performance and consistency across the languages evaluated. This points to a differential in model strengths that could influence deployment decisions based on regional language needs.

Interestingly, results from the feedback selection task reveal a significant drop for all models, signaling a potential area for LLM enhancement. The interactive tutoring task further uncovers inconsistencies in model responses, potentially driven by the complexities of dialogue management in diverse linguistic settings.

Implications and Speculations

The findings emphasize the need for an in-depth evaluation of LLMs across various languages, particularly for educational applications where linguistic accuracy and cultural relevance are paramount. The observed biases suggest a critical opportunity for optimizing LLMs for multilingual settings, perhaps through more balanced and diverse training datasets or through post-training adaptations for specific languages.

Additionally, while the translation grading task underscores the capacity of LLMs to support language learning, it also hints at the challenge of ensuring clarity in cultural context and semantics in educational materials. Future AI developments could focus on enhancing LLM adaptability to maintain pedagogical integrity across linguistic borders.

Finally, the paper hints at some practical implications: the ease of using English prompts suggests a potential strategy for streamlined internationalization of LLM-based educational applications, provided the models exhibit comparable performance in multilingual contexts.

Conclusion

This research highlights the nuanced performance of LLMs in multilingual educational settings and suggests targeted improvements to enhance their utility. By addressing performance biases, developers could harness the full potential of LLMs in diverse educational landscapes, ultimately contributing to equitable educational solutions globally. The paper serves as a foundational reference for future work aimed at addressing linguistic biases and enhancing the global applicability of LLMs in education.

Related Papers

YouTube

Show All Videos