Cebuano & Tagalog Cross-Lingual Evaluation

Updated 19 October 2025

Cross-lingual evaluation is the systematic assessment of NLP models across Cebuano and Tagalog, focusing on language transfer, corpus alignment, and translation accuracy.
Innovative corpus construction and semantic evaluation techniques, including subword translation and n-gram overlap (CROSSNGO), enhance model robustness and performance.
Advanced benchmarks in document retrieval, NER, and meta-pretraining highlight both the progress and challenges in culturally nuanced, low-resource NLP.

Cross-lingual evaluation with Cebuano and Tagalog is the systematic assessment of NLP models across two major Central Philippine languages, with the objective of measuring language transfer, robustness, and resource utility in low-resource settings. Research in this area covers a spectrum from neural machine translation and lexical semantic similarity to the evaluation of LLMs for syntactic, cultural, and factual competences. Evaluation protocols, architectures, benchmark tasks, and linguistic feature selection are calibrated to the typological profile of Philippine languages and the realities of available data.

1. Corpus Construction and Neural Machine Translation

The development of parallel corpora for Cebuano and Tagalog is foundational for supervised cross-lingual evaluation. Corpus building approaches have involved manual alignment and automated web mining:

Bible-based Alignment: Utilizing structured biblical texts (e.g., Genesis) in XML, with sentence-level alignments refined through removal of XML tags, normalization, and the de-duplication of repetitive passages. Inconsistencies, especially in noun (proper names) and verb translation, are rectified using a "copyable" approach (enforcing consistent noun mapping) and subword unit translation (for verb normalization) (Adlaon et al., 2021).
Web-based Mining: Comparable Cebuano and Tagalog Wikipedia articles are collected and aligned by exploiting recurrent topic segments—multi-word phrases characteristic of specific categories (e.g., regions, provinces, cities). These lexical anchors facilitate automatic extraction of parallel sentences.

For model training, Recurrent Neural Networks (unidirectional RNNs) implemented with the OpenNMT framework and TensorFlow are standard (Adlaon et al., 2019, Adlaon et al., 2021). Training is evaluated by tracking logarithmic loss (cross entropy) and monitored using BLEU scores:

Bible corpus (uncorrected): BLEU = 20.01
Bible corpus (with corrections): BLEU = 22.87
Wikipedia corpus: BLEU = 27.36

These results show that preprocessing with subword and copyable translations significantly mitigates inconsistent translation phenomena, improves word alignment accuracy, and boosts BLEU by 2–4 points (Adlaon et al., 2019, Adlaon et al., 2021).

2. Lexical Semantic and Readability Evaluation Across Languages

Lexical semantic similarity resources (e.g., Multi-SimLex) are adapted for cross-lingual evaluation by ensuring that semantically aligned pairs are included in both Cebuano and Tagalog datasets. The process involves translation, annotation by native speakers using a 0–6 ordinal similarity scale, and rigorous quality control through multi-round adjudication and rank correlation metrics (e.g., average pairwise Spearman’s ρ):

$apɪɑɑ = \frac{2}{N(N-1)} \sum_{i,j} \rho(s_i, s_j)$

and

$AMIAA = \frac{1}{N} \sum_i \rho(s_i, \mu_i), \quad \mu_i = \frac{1}{N-1} \sum_{j, j \neq i} s_j$

Cross-lingual extensions are constructed by intersecting aligned pairs and aggregating their scores, resulting in robust evaluation sets that can reveal true model transfer capabilities beyond resource artifacts (Vulić et al., 2020).

Readability assessment leverages:

Traditional surface features: word, sentence, and syllable counts
Orthography-based syllable pattern features (e.g., v, cv, cvc, ccvc patterns)
Neural embeddings (multilingual BERT)

The optimized Random Forest model with traditional and syllable-based features attains ∼87.3% accuracy on Cebuano, mirroring Tagalog results (Reyes et al., 2022). The introduction of novel cross-lingual n-gram overlap features (CROSSNGO) exploits the mutual intelligibility of Cebuano, Tagalog, and related languages. These features, defined as

$\text{CROSSNGO}_{L, n}(d) = \text{count}(m(L)_n \cap m(d))$

(where $m(L)_n$ is the n-gram list for language L and $m(d)$ those in document d), further enhance cross-lingual readability model performance (Imperial et al., 2023).

3. Document Retrieval, Question Answering, and Multitask Benchmarks

For cross-lingual document retrieval, approaches utilize deep bilingual representations in a unified embedding space using Procrustes-aligned fastText vectors, supporting direct retrieval and relevance ranking for languages such as Tagalog and, by extension, Cebuano (Zhang et al., 2019). The relevance score for a query-document pair is a sum over all language and translation pairs:

$s(Q, D) = s(Q, D) + s(Q, \hat{D}) + s(\hat{Q}, D) + s(\hat{Q}, \hat{D})$

where each $s(\cdot, \cdot)$ leverages term interaction models.

Open-retrieval QA evaluation tasks (e.g., MIA 2022 Shared Task) have shown that strong baseline systems achieve F1 ≈ 20.75 on Tagalog using entity-aware representations and explicit augmentation of the retrieval database with Tagalog passages. Performance on Tagalog lags significantly unless these strategies are adopted, highlighting persistent retrieval bottlenecks in under-represented languages (Asai et al., 2022).

Recent multitask benchmarks (e.g., FilBench, Batayan) systematically evaluate LLMs in Tagalog, Cebuano, and code-switched Taglish across categories including cultural knowledge, classical NLP, reading comprehension, and translation generation. Weighted aggregation of category-level task scores, for instance

$\text{FilBench Score} = 100 \times \frac{\sum_{i \in \{\text{CN, CK, GN, RC}\}} n_i \cdot S_i}{\sum_{i \in \{\text{CN, CK, GN, RC}\}} n_i}$

explicates overall performance. Leading LLMs (GPT-4o) reach only ∼72% overall, with generative and translation tasks remaining particularly challenging (average scores for GN <25%), underscoring deficiencies in resource-specific fine-tuning and the importance of cultural context-aware evaluation (Miranda et al., 5 Aug 2025, Montalan et al., 19 Feb 2025).

4. Named Entity Recognition and Meta-Pretraining for Rapid Transfer

NER models trained on annotated Cebuano (CebuaNER), Tagalog, or Hiligaynon corpora test cross-lingual zero-shot transfer to related Philippine languages. Conditional Random Fields (CRF) and BiLSTM architectures provide strong baselines, with optimized CRF models in Cebuano attaining F1≈0.90 on person entities, and moderate zero-shot generalization (F1 ≈ 0.44–0.46 for Cebuano and Tagalog) when models are trained exclusively on Hiligaynon data and evaluated without adaptation (Pilar et al., 2023, Teves et al., 12 Oct 2025). Knowledge transfer is strongest for person entities marked by reliable surface anchors (e.g., Tagalog case particles si/ni).

Meta-pretraining with first-order Model-Agnostic Meta-Learning (MAML) further boosts zero-shot NER: in compact decoder-only LMs (11M–570M parameters), MAML lifts micro-F1 by 2–6 points (head-only tuning) and cuts convergence time by up to 8%. Gains are largest for single-token person entities and diminish as model size increases, illustrating the capacity-dependent utility of meta-pretraining in low-resource adaptation (Africa et al., 2 Sep 2025).

5. Cross-lingual Lifelong Learning and Instruction Following

Continual learning paradigms evaluate models across sequentially presented language datastreams, tracking:

Forgetting (F):

$F = \frac{1}{N-1} \sum_{j=2}^{N} \max_{k \in [1, j-1]} R_{i,\leq k} - R_{i,\leq j}$

Zero-shot transfer ( $T^0$ )
Accumulation and utility metrics

These frameworks are directly applicable for modeling realistic multi-hop adaptation scenarios, where the performance on Cebuano and Tagalog is assessed after sequential exposure (M'hamdi et al., 2022).

Instruction-following evaluation (MaXIFE) and factuality assessment (CCFQA) now encompass Filipino/Tagalog in a suite of cross-lingual instruction and QA tasks, with specialized metrics (e.g., strict and loose scores, F1, and consistency ratios) that quantify the robustness of LLMs under multiple instruction templates and modalities. The absence of Cebuano in MaXIFE, and the inclusion of both Cebuano and Tagalog in CCFQA, underlines ongoing discrepancies in multilingual coverage and the need for modular benchmarks amenable to further language extensions (Liu et al., 2 Jun 2025, Du et al., 10 Aug 2025).

6. Cultural, Linguistic, and Practical Considerations

Benchmarks such as Kalahi stress the necessity of evaluating models for cultural competence in Manila Educated Tagalog (the Filipino standard), using both multiple-choice (MC1, MC2) and open-ended tasks calibrated to Filipino values and communicative norms. Top LLMs achieve only ∼46% correctness compared to human scores of 89.10%, reinforcing the difficulty of cultural transfer and the need for instruction tuning on culturally nuanced corpora (Montalan et al., 20 Sep 2024).

Cross-lingual readability assessment, document retrieval, and NER all benefit from leveraging the linguistic proximity among central Philippine languages, but also reveal structural divergences (e.g., Tagalog’s more overt voice/case marking versus Cebuano’s reduced voice system) which modulate the transfer effectiveness of surface and contextual features.

Open-source resources—whether for readability (https://github.com/imperialite/cebuano-readability) or tagged NER corpora—facilitate experimentation and progress in multilingual Philippine NLP research (Reyes et al., 2022, Pilar et al., 2023).

7. Implications and Future Directions

Cross-lingual evaluation with Cebuano and Tagalog has produced empirical insights:

Neural and data-driven approaches can be adapted and evaluated using both traditional and modern (contextual, instruction-following) metrics.
Architectural and preprocessing choices (e.g., subword translation, n-gram overlap, entity-aware representations, meta-pretraining) substantially affect transfer outcomes.
Mutual intelligibility and purposeful feature engineering—e.g., using the CROSSNGO metric—amplify generalization between related languages.
The remaining performance gaps across translation, comprehension, and generation underscore the need for enriched corpora, tailored fine-tuning, more sophisticated architectures (transformer-based, adapter-based), and culturally participatory evaluation paradigms.
Comprehensive, modular benchmarks permit robust comparison and guide further developments in multilingual and cross-lingual models for under-resourced settings.

Ongoing research is expected to intensify efforts in corpus expansion, benchmark development, and evaluation design to ensure equitable, reliable NLP support for both Cebuano and Tagalog, and by extension for the full diversity of Philippine languages.