Efficient Measuring of Readability to Improve Documents Accessibility for Arabic Language Learners (2109.08648v1)

Published 9 Sep 2021 in cs.CL

Abstract: This paper presents an approach based on supervised machine learning methods to build a classifier that can identify text complexity in order to present Arabic language learners with texts suitable to their levels. The approach is based on machine learning classification methods to discriminate between the different levels of difficulty in reading and understanding a text. Several models were trained on a large corpus mined from online Arabic websites and manually annotated. The model uses both Count and TF-IDF representations and applies five machine learning algorithms; Multinomial Naive Bayes, Bernoulli Naive Bayes, Logistic Regression, Support Vector Machine and Random Forest, using unigrams and bigrams features. With the goal of extracting the text complexity, the problem is usually addressed by formulating the level identification as a classification task. Experimental results showed that n-gram features could be indicative of the reading level of a text and could substantially improve performance, and showed that SVM and Multinomial Naive Bayes are the most accurate in predicting the complexity level. Best results were achieved using TF-IDF Vectors trained by a combination of word-based unigrams and bigrams with an overall accuracy of 87.14% over four classes of complexity.

Citations (1)

View on Semantic Scholar

Summary

The paper proposes a supervised machine learning framework to classify Arabic text readability into four levels (Easy, Medium, Difficult, Very Difficult) for language learners.
A large corpus of 39,792 Arabic documents was manually categorized, preprocessed (tokenizing, filtering, normalizing, stemming), and used with statistical features (Count, TF-IDF) for classification.
Experimental evaluation showed high accuracy, with SVM and Multinomial Naive Bayes performing best using n-gram features (unigrams and bigrams combined) for predicting text difficulty levels.

The paper "Efficient Measuring of Readability to Improve Documents Accessibility for Arabic Language Learners" presents a novel approach to predict text complexity levels specifically tailored for Arabic language learners. The approach leverages supervised machine learning techniques to construct a classifier capable of discriminating between varying levels of text difficulty. This method intends to facilitate the selection of suitable reading materials for learners, thus enhancing educational outcomes.

Data Collection and Preprocessing

A substantial corpus of 39,792 Arabic documents was amassed from various online sources. These documents were manually categorized into four distinct complexity levels: Easy, Medium, Difficult, and Very Difficult. These classifications were informed by specific target audience criteria, such as primary education, high school education, collegiate text competence, and linguistic expertise.

The preprocessing phase involved several steps to purify the dataset and improve classifier performance:

Tokenizing: Splitting text into word tokens for processing.
Filtering: Eliminating punctuation, non-Arabic characters, and other irrelevant data.
Normalizing: Standardizing certain Arabic letters.
Stop words Removal: Excluding common function words without significant classification utility.
Stemming: Applying light stemming to reduce words to base forms by removing affixes, without altering root meaning significantly.

Methodology

The authors employed a supervised classification framework, deploying various machine learning algorithms:

Multinomial Naïve Bayes
Bernoulli Naïve Bayes
Logistic Regression
Support Vector Machine (SVM)
Random Forest

The model utilized statistical representations of text data, employing both Count and TF-IDF vectors, with unigrams and bigrams as features. The task was formulated as a multi-class classification problem, aiming to maximize the conditional probability of associating a document with its corresponding readability level.

Experimental Evaluation

The experiments were conducted using 80% of the data for training and the remaining 20% for testing. The key findings indicated that:

Multinomial Naïve Bayes achieved the highest accuracy with Count Vector representations, peaking at 86.47%.
Support Vector Machine excelled with TF-IDF Vectors, with a top accuracy of 87.14%.
Using a combination of unigrams and bigrams consistently provided better classification accuracy across both vector types compared to unigrams or bigrams alone.

Discussion and Conclusions

The experimental results underscore the effectiveness of n-gram features in readability classification, which significantly enhances predictive performance. The paper reveals notable insights regarding data representation, as TF-IDF Vectors displayed superior performance over Count Vectors in certain conditions. The authors advocate for further improvements by expanding the corpus, incorporating deeper linguistic features, and refining granularity in difficulty levels.

This work contributes substantially to the relatively underexplored domain of Arabic text readability and proposes a machine learning-based framework that could be pivotal for educators and learners alike. The approach not only aids in selecting appropriate educational content but also holds potential benefits for understanding complexity in other text-related applications such as e-learning platforms and content generation.

PDF Markdown