Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 186 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Cebuano Readability Corpus Overview

Updated 14 September 2025
  • Cebuano Readability Corpus is a curated dataset of annotated children’s narratives segmented into educational grade levels to enable automatic readability assessment.
  • It employs a blend of traditional surface, syllable pattern, cross-lingual n-gram, and neural features to capture text complexity in a low-resource language context.
  • Empirical findings show that integrating cross-lingual and hierarchical modeling significantly enhances readability prediction accuracy for Central Philippine languages.

The Cebuano Readability Corpus refers collectively to annotated datasets, feature extraction protocols, and associated modeling pipelines designed for automatic readability assessment (ARA) in Cebuano, the Philippines' second most-spoken native language. Recent research has systematically defined the methodology, linguistic features, and modeling paradigms necessary to support both monolingual and crosslingual readability analysis in Cebuano and related Central Philippine languages. These studies demonstrate robust performance using a combination of surface-based, orthographic, and neural features, while also highlighting the utility of cross-lingual features in low-resource NLP settings.

1. Corpus Composition and Annotation

The Cebuano Readability Corpus comprises curated literary pieces—primarily children’s narratives and short stories—annotated into discrete grade levels (L1, L2, L3). Initial versions include 277 annotated documents sourced from resources such as Let’s Read Asia and Bloom Library (Reyes et al., 2022, Imperial et al., 2023). These texts were selected to cover a diversity of genres, with per-document statistics (mean word count, sentence count, vocabulary size) supporting fine-grained readability analysis.

Annotations reflect educational grade bands and are aligned with the practice in related languages, ensuring that readability models can target educational and pedagogical requirements. The same corpus supports other Central Philippine language projects leveraging Cebuano as a “parent” language in the Bisayan subgroup (Imperial et al., 2023).

2. Feature Extraction Methodologies

Feature extraction protocols for the Cebuano Readability Corpus are grounded in linguistic typology and orthography:

a. Traditional Surface-Based Features (TRAD)

  • Number of unique words
  • Total number of words
  • Average word length
  • Average number of syllables per word
  • Total number of sentences
  • Average sentence length
  • Number of polysyllable words

These frequency-based features capture lexical diversity, word and sentence complexity, and are directly inherited from established formulas in Filipino readability research (Reyes et al., 2022).

b. Syllable Pattern Features (SYLL)

Seven (or, with extensions, up to eleven) syllable templates based on Cebuano’s documented orthography are extracted:

  • v (single vowel)
  • cv (consonant-vowel)
  • cc (consonant cluster)
  • vc (vowel-consonant)
  • cvc (consonant-vowel-consonant)
  • ccv (consonant-consonant-vowel)
  • ccvc (consonant-consonant-vowel-consonant)
  • [additional patterns: vcc, cvcc as in BasahaCorpus (Imperial et al., 2023)]

Density for each pattern is normalized by total words, as

dpattern=count(pattern)totalwordsd_{pattern} = \frac{\mathrm{count(pattern)}}{\mathrm{total\, words}}

Syllable-based features are specifically attuned to Cebuano’s orthographic constraints, targeting character-level complexity highly relevant for low-resource Philippine languages.

c. Cross-lingual N-gram Overlap Features (CROSSNGO)

A novel cross-lingual feature, CROSSNGO, captures mutual intelligibility between closely related languages by quantifying bigram and trigram overlap:

CrossNGOl,n(d)=m(L)nm(d)ncount(m(d)n)\mathrm{CrossNGO}_{l,n}(d) = \frac{m(L)_n \cap m(d)_n}{\mathrm{count}(m(d)_n)}

where m()nm(\cdot)_n is the set of unique n-grams from language LL or document dd.

CROSSNGO establishes a connection between Cebuano and other Central Philippine languages (e.g., Tagalog, Hiligaynon) by exploiting shared high-frequency n-grams, bypassing the scarcity of deep language-specific NLP tools (Imperial et al., 2023, Imperial et al., 2023).

d. Neural Embeddings

Multilingual BERT (mBERT) embeddings are generated for Cebuano texts by mean pooling across the model’s 12 layers, yielding 768-dimensional dense vectors. mBERT is pre-trained on large multi-language Wikipedia dumps—which includes Cebuano—providing contextual representations that encode semantic and syntactic properties (Reyes et al., 2022, Imperial et al., 2023).

3. Modeling Paradigms and Empirical Results

The principal modeling framework for Cebuano readability is a Random Forest (RF) classifier. Key hyperparameters:

  • Number of estimators: 100
  • Maximum depth: 20
  • Maximum features: automatically tuned via grid search
  • Cross-validation: stratified k=5k=5 folds

Multiple comparative experiments evaluated feature sets and their combinations:

Features Accuracy (Cebuano only) Accuracy (Cross-lingual)
TRAD + SYLL ~87%
mBERT alone 71.015%
TRAD + SYLL + CROSSNGO 78.27%
TRAD + SYLL + CROSSNGO + mBERT 79.710%

Empirical findings indicate that traditional features (TRAD + SYLL) are highly effective for Cebuano, outperforming neural-only models. The addition of CROSSNGO further improves accuracy, particularly in joint setups with related languages (Reyes et al., 2022, Imperial et al., 2023).

4. Cross-lingual and Hierarchical Modeling

Cross-lingual strategies leverage mutual intelligibility and family tree relations within Central Philippine languages. Several configurations, as explored in BasahaCorpus (Imperial et al., 2023), demonstrate the impact of incorporating Cebuano data:

  • Monolingual (L): Model trained solely on Cebuano data.
  • L + Parent Lang (L+P): Target language data augmented with Cebuano as "parent" for Bisayan subgroup languages.
  • L + National Lang (L+N): Combination of target language with Tagalog.
  • L + P + N: Target, parent (Cebuano), and Tagalog data pooled.
  • All Languages (*L): Pooled data from Cebuano, Tagalog, Bikol, and other Bisayan languages.

Results show that increased n-gram overlap between Cebuano and related languages yields measurable gains in readability prediction accuracy, affirming that surface-level and syllable-pattern features are highly transferable and informative within this language family.

A plausible implication is that assembling larger joint corpora with balanced representation from Cebuano and related languages in the Central Philippine family will continue to yield improvements in readability assessment performance, particularly for lower-resourced target languages.

5. Open-Source Availability and Benchmarking

Both the Cebuano Readability Corpus and associated codebases have been released via public repositories such as https://github.com/imperialite/cebuano-readability (Reyes et al., 2022). This open-source contribution allows for:

  • Reproducibility and validation of baseline models
  • Extension of feature sets (including custom neural or cross-lingual metrics)
  • Further cross-lingual experimentation among Philippine languages

This resource serves as a benchmark for evaluating future models, ensuring that subsequent improvements in automatic readability assessment methodologies can be measured against standardized data and protocols.

6. Significance and Applicability

The Cebuano Readability Corpus operationalizes a comprehensive set of linguistically informed features and modeling best practices that can be replicated for other low-resource languages with similar orthographic and typological characteristics. It demonstrates:

  • Robust ARA modeling is possible without deep, language-specific NLP resources, by systematically leveraging orthographic, surface, and mutual intelligibility-based features.
  • Readability features and modeling workflows are highly portable across closely related languages within the Central Philippine family, supporting sustainable NLP resource development (Reyes et al., 2022, Imperial et al., 2023).

The synergy of traditional surface-based metrics, explicit orthographic analysis, neural representations, and cross-lingual transfer—validated on a growing suite of annotated corpora—establishes the Cebuano Readability Corpus as a foundational resource for linguistic analysis, educational NLP applications, and broader cross-language ARA research in low-resource environments.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Cebuano Readability Corpus.