BasahaCorpus: An Expanded Linguistic Resource for Readability Assessment in Central Philippine Languages (2310.11584v1)
Abstract: Current research on automatic readability assessment (ARA) has focused on improving the performance of models in high-resource languages such as English. In this work, we introduce and release BasahaCorpus as part of an initiative aimed at expanding available corpora and baseline models for readability assessment in lower resource languages in the Philippines. We compiled a corpus of short fictional narratives written in Hiligaynon, Minasbate, Karay-a, and Rinconada -- languages belonging to the Central Philippine family tree subgroup -- to train ARA models using surface-level, syllable-pattern, and n-gram overlap features. We also propose a new hierarchical cross-lingual modeling approach that takes advantage of a language's placement in the family tree to increase the amount of available training data. Our study yields encouraging results that support previous work showcasing the efficacy of cross-lingual models in low-resource settings, as well as similarities in highly informative linguistic features for mutually intelligible languages.
- A large-scale leveled readability lexicon for Standard Arabic. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3053–3062, Marseille, France. European Language Resources Association.
- Patricia L Carrell. 1987. Readability in ESL. University of Hawaii National Foreign Language Resource Center.
- Simple or Complex? Learning to Predict Readability of Bengali Texts. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12621–12629.
- David J Chard and Jean Osborn. 1999. Phonics and word recognition instruction in early reading programs: Guidelines for accessibility. Learning Disabilities Research & Practice, 14(2):107–117.
- Combining Latent Semantic Analysis and Pre-trained Model for Vietnamese Text Readability Assessment: Combining Statistical Semantic Embeddings and Pre-trained Model for Vietnamese Long-Sequence Readability Assessment. In Proceedings of the 4th International Conference on Information Technology and Computer Communications, pages 45–52.
- Coh-Metrix: Analysis of Text on Cohesion and Language. Behavior Research Methods, Instruments, & Computers, 36(2):193–202.
- Merry Ruth M Gutierrez. 2015. The Suitability of the Fry and SMOG Readability Formulae in Determining the Readability of Filipino Texts. The Normal Lights, 8(1).
- Joseph Marvin Imperial and Ekaterina Kochmar. 2023. Automatic Readability Assessment for Closely Related Languages. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5371–5386, Toronto, Canada. Association for Computational Linguistics.
- Joseph Marvin Imperial and Ethel Ong. 2020. Exploring Hybrid Linguistic Feature Sets to Measure Filipino Text Readability. In 2020 International Conference on Asian Language Processing (IALP), pages 175–180. IEEE.
- Joseph Marvin Imperial and Ethel Ong. 2021a. Diverse Linguistic Features for Assessing Reading Difficulty of Educational Filipino Texts. In 29th International Conference on Computers in Education Conference, ICCE 2021, pages 51–56. Asia-Pacific Society for Computers in Education.
- Joseph Marvin Imperial and Ethel Ong. 2021b. Under the microscope: Interpreting readability assessment models for Filipino. In Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation, pages 1–10, Shanghai, China. Association for Computational Lingustics.
- A baseline readability model for Cebuano. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), pages 27–32, Seattle, Washington. Association for Computational Linguistics.
- Developing a machine learning-based grade level classifier for Filipino children’s literature. In 2019 International Conference on Asian Language Processing (IALP), pages 413–418. IEEE.
- Zahrul Islam and Rashedur Rahman. 2014. Readability of Bangla news articles for children. In Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing, pages 309–317, Phuket,Thailand. Department of Linguistics, Chulalongkorn University.
- Text readability classification of textbooks of a low-resource language. In Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation, pages 545–553, Bali, Indonesia. Faculty of Computer Science, Universitas Indonesia.
- Derivation Of New Readability Formulas (Automated Readability Index, Fog Count And Flesch Reading Ease Formula) For Navy Enlisted Personnel. Naval Technical Training Command Millington TN Research Branch.
- Pushing on text readability assessment: A transformer meets handcrafted linguistic features. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10669–10686, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Justin Lee and Sowmya Vajjala. 2022. A neural pairwise ranking model for readability assessment. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3802–3813, Dublin, Ireland. Association for Computational Linguistics.
- Heidi B Macahilig. 2014. A content-based readability formula for Filipino texts. The Normal Lights, 8(1).
- Ion Madrazo Azpiazu and Maria Soledad Pera. 2020. Is Cross-Lingual Readability Assessment Possible? Journal of the Association for Information Science and Technology, 71(6):644–656.
- Curtis D McFarland. 2004. The Philippine language situation. World Englishes, 23(1):59–75.
- Cross-lingual transfer learning with Persian. In Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 89–95, Dubrovnik, Croatia. Association for Computational Linguistics.
- Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12:2825–2830.
- Emily Pitler and Ani Nenkova. 2008. Revisiting readability: A unified framework for predicting text quality. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 186–195, Honolulu, Hawaii. Association for Computational Linguistics.
- Feature optimization for predicting readability of Arabic L1 and L2. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, pages 20–29, Melbourne, Australia. Association for Computational Linguistics.
- Sowmya Vajjala. 2022. Trends, limitations and open challenges in automatic readability assessment research. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5366–5377, Marseille, France. European Language Resources Association.
- Sowmya Vajjala and Detmar Meurers. 2012. On improving the accuracy of readability classification using insights from second language acquisition. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pages 163–173, Montréal, Canada. Association for Computational Linguistics.
- Sowmya Vajjala and Taraka Rama. 2018. Experiments with universal CEFR classification. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 147–153, New Orleans, Louisiana. Association for Computational Linguistics.
- Readability Assessment of Textbooks in Low Resource Languages. Computers, Materials & Continua, 61(1).
- Using Broad Linguistic Complexity Modeling for Cross-Lingual Readability Assessment. In Proceedings of the 10th Workshop on NLP for Computer Assisted Language Learning, pages 38–54, Online. LiU Electronic Press.
- IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 843–857, Suzhou, China. Association for Computational Linguistics.
- NusaX: Multilingual parallel sentiment dataset for 10 Indonesian local languages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 815–834, Dubrovnik, Croatia. Association for Computational Linguistics.
- Text readability assessment for second language learners. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 12–22, San Diego, CA. Association for Computational Linguistics.
- R. David Zorc. 1976. The Bisayan dialects of the Philippines: Subgrouping and reconstruction. Pacific Linguistics, Research School of Pacific and Asian Studies.