Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fumbling in Babel: An Investigation into ChatGPT's Language Identification Ability (2311.09696v2)

Published 16 Nov 2023 in cs.CL

Abstract: ChatGPT has recently emerged as a powerful NLP tool that can carry out a variety of tasks. However, the range of languages ChatGPT can handle remains largely a mystery. To uncover which languages ChatGPT `knows', we investigate its language identification (LID) abilities. For this purpose, we compile Babel-670, a benchmark comprising 670 languages representing 24 language families spoken in five continents. Languages in Babel-670 run the gamut from the very high-resource to the very low-resource. We then study ChatGPT's (both GPT-3.5 and GPT-4) ability to (i) identify language names and language codes (ii) under zero- and few-shot conditions (iii) with and without provision of a label set. When compared to smaller finetuned LID tools, we find that ChatGPT lags behind. For example, it has poor performance on African languages. We conclude that current LLMs would benefit from further development before they can sufficiently serve diverse communities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Toward micro-dialect identification in diaglossic and code-switched environments. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5855–5876, Online. Association for Computational Linguistics.
  2. Ife Adebara and Muhammad Abdul-Mageed. 2022. Towards afrocentric NLP for African languages: Where we are and where we can go. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3814–3841, Dublin, Ireland. Association for Computational Linguistics.
  3. AfroLID: A neural language identification tool for African languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1958–1981, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  4. Serengeti: Massively multilingual language models for africa.
  5. Improving african language identification with multi-task learning. In 4th Workshop on African Natural Language Processing.
  6. Evershed Amuzu and John Singler. 2014. Codeswitching in west africa. International Journal of Bilingualism, 18:329–345.
  7. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc.
  8. Sparks of artificial general intelligence: Early experiments with gpt-4.
  9. An open dataset and model for language identification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 865–879, Toronto, Canada. Association for Computational Linguistics.
  10. Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus. CoRR, abs/2010.14571.
  11. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  12. N. Dongen. 2017. Analysis and prediction of Dutch-English code-switching in Dutch social media messages.
  13. Findings of the second americasnlp competition on speech-to-text translation. In Proceedings of the NeurIPS 2022 Competitions Track, volume 220 of Proceedings of Machine Learning Research, pages 217–232. PMLR.
  14. Evaluation of language identification methods using 285 languages. In Proceedings of the 21st Nordic Conference on Computational Linguistics, pages 183–191, Gothenburg, Sweden. Association for Computational Linguistics.
  15. Automatic language identification in texts: A survey. J. Artif. Int. Res., 65(1):675–682.
  16. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
  17. Bag of tricks for efficient text classification.
  18. Incorporating dialectal variability for socially equitable language identification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 51–57.
  19. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  20. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  21. Marco Lui and Timothy Baldwin. 2012. langid.py: An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations, pages 25–30, Jeju Island, Korea. Association for Computational Linguistics.
  22. Bhasa-abhijnaanam: Native-script and romanized language identification for 22 Indic languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 816–826, Toronto, Canada. Association for Computational Linguistics.
  23. Pan 2017: Author profiling - gender and language variety prediction. In CLEF.
  24. OpenAI. 2023. Gpt-4 technical report.
  25. Robust speech recognition via large-scale weak supervision.
  26. Compact language detector v3.
  27. Nakatani Shuyo. 2010. Language detection library for java.
  28. Transformer based language identification for malayalam-english code-mixed text. IEEE Access, 9:118837–118850.
  29. Improved language identification through cross-lingual self-supervised learning.
  30. Chris van der Lee and Antal van den Bosch. 2017. Exploring lexical and syntactic features for language variety identification. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pages 190–199, Valencia, Spain. Association for Computational Linguistics.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com