An Open Multilingual System for Scoring Readability of Wikipedia (2406.01835v1)

Published 3 Jun 2024 in cs.CL and cs.AI

Abstract: With over 60M articles, Wikipedia has become the largest platform for open and freely accessible knowledge. While it has more than 15B monthly visits, its content is believed to be inaccessible to many readers due to the lack of readability of its text. However, previous investigations of the readability of Wikipedia have been restricted to English only, and there are currently no systems supporting the automatic readability assessment of the 300+ languages in Wikipedia. To bridge this gap, we develop a multilingual model to score the readability of Wikipedia articles. To train and evaluate this model, we create a novel multilingual dataset spanning 14 languages, by matching articles from Wikipedia to simplified Wikipedia and online children encyclopedias. We show that our model performs well in a zero-shot scenario, yielding a ranking accuracy of more than 80% across 14 languages and improving upon previous benchmarks. These results demonstrate the applicability of the model at scale for languages in which there is no ground-truth data available for model fine-tuning. Furthermore, we provide the first overview on the state of readability in Wikipedia beyond English.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel multilingual readability scoring model using a neural pairwise ranking approach and achieves high accuracy.
It uses a dataset spanning 14 languages, including 11 new ones, and enhances text extraction by parsing HTML instead of wikitext.
The model exhibits strong zero-shot cross-lingual transfer with ranking accuracies above 0.80 and near-perfect performance on simplewiki-en.

An Open Multilingual System for Scoring Readability of Wikipedia

Overview

The paper "An Open Multilingual System for Scoring Readability of Wikipedia," authored by Mykola Trokhymovych, Indira Sen, and Martin Gerlach, tackles the crucial problem of automatic readability assessment (ARA) across multilingual Wikipedia articles. While traditional efforts have largely focused on English, there is a growing need to address the accessibility of Wikipedia content in over 300 languages. The authors propose a novel multilingual model to score the readability of Wikipedia articles, presenting strong empirical results and providing a significant contribution to the field.

Data and Model

The authors created a novel multilingual dataset spanning 14 languages by matching Wikipedia articles with their simplified counterparts from online children’s encyclopedias. This dataset outperforms previous multilingual datasets by including new sources and languages and offers cleaner text through advanced preprocessing techniques. For example, the dataset introduced 11 new languages, including Basque, Catalan, and Sicilian, and improved text extraction by parsing HTML instead of wikitext.

The core of their system is a multilingual readability scoring model trained using a neural pairwise ranking model (NPRM) architecture with a margin ranking loss. Their approach utilizes deep neural network-based LLMs like XLM-RoBERTa and a sentence-by-sentence model (SRank) based on BERT. The model is primarily fine-tuned on English but demonstrates zero-shot capabilities across other languages. The authors leverage zero-shot cross-lingual transfer learning, which is validated through the successful application to datasets where specific language fine-tuning is absent.

Experimental Evaluation

The model yields impressive results, achieving over 0.80 in ranking accuracy across all tested languages. For the primary dataset, simplewiki-en, the model attained a nearly perfect ranking accuracy of 0.976. This extends to other datasets like vikidia-en, which saw similar success, thus demonstrating the model's robustness.

When validated on previously established benchmarks such as Vikidia-En, Vikidia-Fr, and OneStopEnglish (OSE), the model again outperformed existing baselines, including the original NPRM approach. For instance, the ranking accuracy reported for Vikidia-En was 0.984, significantly higher than that of previous leading models.

Analysis and Application

The authors further examined their model's interpretability by correlating its scores with well-known readability formulas such as the Flesch Reading Ease and the Flesch-Kincaid grade level. This correlation, seen strongly across several languages, helps validate the model’s scoring in terms of existing readability benchmarks.

A comprehensive application of their model was demonstrated by generating a state of readability in 24 different Wikipedia language editions. This analysis indicated that the textual complexity across languages is often higher than desired for the average reader, echoing findings from previous studies on English Wikipedia.

Implications and Future Directions

The implications of this research are manifold:

Enhancing Readability Research: The introduction of an openly available, high-quality dataset and an efficient model facilitates more extensive research in multilingual readability. Researchers can now better investigate and improve the accessibility of content across different languages.
Educational and Sociocultural Impact: By highlighting articles with low readability, this research aids editors in simplifying complex texts, thus broadening the reach of Wikipedia to less literate or younger readers.
Text Simplification: The results and the associated public API endpoint provide a practical tool for automated text simplification efforts, assisting in editing workflows and developing auxiliary educational tools.

Conclusion

The research by Trokhymovych et al. stands out for its extensive dataset, innovative model architecture, and consistent performance across multiple languages. By addressing the multilingual aspect of readability assessment, this paper makes a significant contribution to the field, providing critical tools and data for future research and applied efforts in making Wikipedia content universally accessible. Future developments could focus on expanding the model's capabilities to even more languages and integrating fine-grained, sentence-level readability assessments to further enhance the model's applicability.

PDF Markdown

Related Papers

Tweets

https://twitter.com/WikiResearch/status/1798424789348225173