Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition (2404.00565v1)

Published 31 Mar 2024 in cs.CL

Abstract: Wikipedia articles (content pages) are commonly used corpora in NLP research, especially in low-resource languages other than English. Yet, a few research studies have studied the three Arabic Wikipedia editions, Arabic Wikipedia (AR), Egyptian Arabic Wikipedia (ARZ), and Moroccan Arabic Wikipedia (ARY), and documented issues in the Egyptian Arabic Wikipedia edition regarding the massive automatic creation of its articles using template-based translation from English to Arabic without human involvement, overwhelming the Egyptian Arabic Wikipedia with articles that do not only have low-quality content but also with articles that do not represent the Egyptian people, their culture, and their dialect. In this paper, we aim to mitigate the problem of template translation that occurred in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics through exploratory analysis and building automatic detection systems. We first explore the content of the three Arabic Wikipedia editions in terms of density, quality, and human contributions and utilize the resulting insights to build multivariate machine learning classifiers leveraging articles' metadata to detect the template-translated articles automatically. We then publicly deploy and host the best-performing classifier, XGBoost, as an online application called EGYPTIAN WIKIPEDIA SCANNER and release the extracted, filtered, and labeled datasets to the research community to benefit from our datasets and the online, web-based detection system.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. All Translation Tools Are Not Equal: Investigating the Quality of Language Translation for Forced Migration. In 2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA), pages 1–10.
  2. Error Analysis of Pretrained Language Models (PLMs) in English-to-Arabic Machine Translation. Human-Centric Intelligent Systems.
  3. AraMUS: Pushing the Limits of Data and Model Scale for Arabic Natural Language Processing. arXiv preprint arXiv:2306.06800.
  4. Performance Implications of Using Unrepresentative Corpora in Arabic Natural Language Processing. In Proceedings of ArabicNLP 2023, pages 218–231, Singapore (Hybrid). Association for Computational Linguistics.
  5. DEPTH+: An Enhanced Depth Metric for Wikipedia Corpora Quality. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 175–189, Toronto, Canada. Association for Computational Linguistics.
  6. Learning From Arabic Corpora But Not Always From Arabic Speakers: A Case Study of the Arabic Wikipedia Editions. In Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP), pages 361–371, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  7. Maher Asaad Baker. 2022. How I Wrote a Million Wikipedia Articles, 2 edition. BookRix GmbH & Co. KG., Munich, Germany.
  8. Runa Bhattacharjee and Pau Giner. 2022. You Can Now Use Google Translate to Translate Articles on Wikipedia. Last accessed on 2024-03-01.
  9. Leo Breiman. 2001. Random Forests. Machine Learning, 45:5–32.
  10. John Bissell Carroll. 1964. Language and Thought. Prentice-Hall.
  11. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A Library for Support Vector Machines. In ACM Transactions on Intelligent Systems and Technology, volume 2, New York, NY, USA. Association for Computing Machinery.
  12. John W Chotlos. 1944. IV. A Statistical and Comparative Analysis of Individual Written Language Samples. Psychological Monographs, 56(2):75.
  13. Noam Cohen. 2008. Open-Source Troubles in Wiki World. The New York Times. Last accessed on 2024-03-01.
  14. Lexical Richness in the Spontaneous Speech of Bilinguals. Applied Linguistics, 24(2):197–222.
  15. Alok Das. 2020. Neural Machine Translation (NMT): Inherent Inadequacy, Misrepresentation, and Cultural Bias. International Journal of Translation, 32:115–145.
  16. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases With Noise. U.S. Department of Energy Office of Scientific and Technical Information.
  17. LIBLINEAR: A Library for Large Linear Classification. the Journal of Machine Learning Research, 9:1871–1874.
  18. Pierre Guiraud. 1954. Les Caractères Statistiques du Vocabulaire: Essai de Méthodologie. Presses universitaires de France, Paris, France.
  19. Pierre Guiraud. 1959. Problèmes et Méthodes de la Statistique Linguistique. D. Reidel, Dordrecht, Holland.
  20. Ari Hautasaari. 2013. “Could Someone Please Translate This?”: Activity Analysis of Wikipedia Article Translation by Non-experts. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work, CSCW ’13, page 945–954, New York, NY, USA. Association for Computing Machinery.
  21. The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 92–104, Kyiv, Ukraine (Virtual). Association for Computational Linguistics.
  22. Isaac Johnson and Emily Lescak. 2022. Considerations for Multilingual Wikipedia Research. arXiv preprint arXiv:2204.02483.
  23. Maria Lopez-Medel. 2021. Gender bias in machine translation: an analysis of Google Translate in English and Spanish. Academia.edu.
  24. Philip Mccarthy and Scott Jarvis. 2010. MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior research methods, 42:381–92.
  25. Philip M McCarthy. 2005. An Assessment of the Range and Usefulness of Lexical Diversity Measures and the Potential of the Measure of Textual, Lexical Diversity (MTLD). Ph.D. thesis, The University of Memphis.
  26. Using Wikipedia to Measure Public Interest in Biodiversity and Conservation. Conservation Biology, 35.
  27. JASMINE: Arabic GPT Models for Few-Shot Learning. arXiv preprint arXiv:2212.10755.
  28. Reducing Gender Bias in Machine Translation through Counterfactual Data Generation. arXiv preprint arXiv:2311.16362.
  29. A Corpus of Native, Non-native and Translated Texts. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4197–4201, Portorož, Slovenia. European Language Resources Association (ELRA).
  30. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  31. Assessing Gender Bias in Machine Translation: A Case Study With Google Translate. Neural Computing and Applications, 32.
  32. Motaz Saad and Basem Alijla. 2017. WikiDocsAligner: An Off-the-Shelf Wikipedia Documents Alignment Tool. In Palestinian International Conference on Information and Communication Technology (PICICT).
  33. Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models. arXiv preprint arXiv:2308.16149.
  34. Lucas Shen. 2022. LexicalRichness: A Small Module to Compute Textual Lexical Richness. Last accessed on 2024-03-01.
  35. Evaluating Gender Bias in Machine Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1679–1684, Florence, Italy. Association for Computational Linguistics.
  36. Mildred C. Templin. 1957. Certain Language Skills in Children: Their Development and Interrelationships, NED–New edition, volume 26. University of Minnesota Press.
  37. A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism. arXiv preprint arXiv:2401.05749.
  38. Stefanie Ullmann and Danielle Saunders. 2021. Google Translate is sexist. What it needs is a little gender-sensitivity training. Last accessed on 2024-03-01.
  39. Wikidata. 2024a. Wikidata: Requests For Permissions/Bot/DarijaBot. Last accessed on 2024-03-01.
  40. Wikidata. 2024b. Wikidata: Requests For Permissions/Bot/JarBot. Last accessed on 2024-03-01.
  41. Wikimedia. 2024. Wikimedia Downloads. Last accessed on 2024-03-01.
  42. Wikimedia Foundation. 2022. Content Translation – Mediawiki. Last accessed on 2024-03-01.
  43. Wikimedia Statistics. 2024. New Pages: Egyptian Arabic Wikipedia. Last accessed on 2024-03-01.
  44. Wikipedia. 2020. Steward Removal of Flags on ARZWiki. Last accessed on 2024-03-01.
  45. Wikipedia. 2024a. Wiki Markup. Last accessed on 2024-03-01.
  46. Wikipedia. 2024b. Wikipedia: Bot Policy. Last accessed on 2024-03-01.
  47. Junjie Wu. 2012. Advances in K-means Clustering: A Data Mining Thinking. Springer Science & Business Media.
  48. XGBoost. 2024. XGBoost Documentation. Last accessed on 2024-03-01.
  49. Marie Lisandra Zepeda-Mendoza and Osbaldo Resendis-Antonio. 2013. Hierarchical Agglomerative Clustering, pages 886–887. Springer New York, New York, NY.
  50. Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks, pages 46–50, Valletta, Malta. University of Malta.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com