HiligayNER: NER Baseline for Hiligaynon
- HiligayNER is the first public baseline system for Named Entity Recognition in Hiligaynon, utilizing a carefully annotated corpus from diverse sources.
- The system employs mBERT and XLM-RoBERTa with token-level softmax classification, achieving macro F1 scores around 0.86 through refined fine-tuning strategies.
- Cross-lingual evaluations on Cebuano and Tagalog demonstrate HiligayNER’s potential for low-resource transfer, paving the way for equitable regional NLP advancements.
HiligayNER is the first publicly available baseline system for Named Entity Recognition (NER) in Hiligaynon, an Austronesian language spoken primarily in the Western Visayas and Soccsksargen regions of the Philippines. Developed in response to the lack of annotated corpora and computational models for low-resource Philippine languages, HiligayNER leverages multilingual Transformer architectures and a carefully curated dataset to provide robust NER capabilities. The system is designed to support practical applications in regional language processing and to facilitate research into cross-lingual and low-resource NER.
1. Corpus Collection and Annotation Standards
The HiligayNER dataset comprises over 8,000 annotated sentences drawn from diverse sources, including news media (“Ang Pulong Sang Dios,” “Ilonggo News Live,” “Hiligaynon News and Features,” “Bombo Radyo Bacolod,” and “Ilonggo Balita sa Uma”), social media posts, and literary texts (Teves et al., 12 Oct 2025). Sentences were carefully cleaned and selected to ensure both linguistic authenticity and coverage across genres. Annotation was performed by native speakers of Hiligaynon, primarily linguistics students with extensive training, achieving a Cohen’s Kappa score of 0.81, denoting substantial inter-annotator agreement.
The annotation scheme follows the industry-standard BIO (Beginning, Inside, Outside) protocol, with four entity categories: Person (B-PER, I-PER), Organization (B-ORG, I-ORG), Location (B-LOC, I-LOC), and Other (OTH). Annotation was conducted at the sentence level to facilitate fine-grained token-wise NER and structured downstream processing.
| Entity Type | BIO Tagging | Frequency (relative) |
|---|---|---|
| Person | B-PER, I-PER | Highest |
| Organization | B-ORG, I-ORG | Moderate |
| Location | B-LOC, I-LOC | Moderate |
| Other | OTH | Variable |
The annotation was scrutinized for consistency, with error patterns analyzed to ensure entity boundary adherence and to identify confusion points—most notably at organization-location transitions.
2. Model Architecture and Fine-Tuning
HiligayNER employs two leading multilingual Transformer models: Multilingual BERT (mBERT) and XLM-RoBERTa (XLM-R). Both models are pretrained on large multilingual corpora but differ in depth, hidden size, and corpus scale.
mBERT is a 12-layer Transformer model with 768-dimensional hidden states and 12 attention heads, pretrained on Wikipedia dumps from 104 languages. XLM-RoBERTa extends this paradigm to 24 layers, 1024-dimensional hidden states, and 16 attention heads, with pretraining on 2.5 TB of Common Crawl data and SentencePiece tokenization covering 100 languages.
For NER, both models append a token-level softmax classification head to the final layer representations. Specifically, for each token representation , the entity label distribution is computed as:
The optimization during fine-tuning aligns the predicted label distributions with annotated entities via cross-entropy loss. The fine-tuning is accomplished using the Hugging Face Transformers framework; a reduced learning rate () is used for XLM-R, accounting for its increased capacity and to stabilize gradient flow under cross-lingual transfer.
3. Evaluation Metrics and Empirical Results
Performance evaluation is conducted using precision, recall, and F1-score, all at the token level, across individual entity types and as a macro average. Both mBERT and XLM-RoBERTa achieve strong metrics, with macro F1-scores of approximately 0.86 for mBERT and nearly identical results for XLM-R.
| Model | Macro F1 | B-PER Precision | B-ORG F1 | B-LOC F1 |
|---|---|---|---|---|
| mBERT | ≈0.86 | 0.96 | ≈0.82 | ≈0.83 |
| XLM-R | ≈0.86 | 0.96 | ≈0.81 | ≈0.82 |
Person entity recognition exhibits the highest precision, correlating with category frequency and the presence of clear lexical anchors in Hiligaynon. Organization boundaries are most challenging; error analysis reveals confusion with location boundaries and cases of entity nesting. Training stability is demonstrated by rapid decay and plateauing of validation loss (remained below 0.05), with final F1 scores reaching 0.87–0.88 and no sign of overfitting.
4. Cross-Lingual Transfer and Comparative Analysis
To assess generalizability, both models were evaluated in zero-shot transfer mode on Cebuano and Tagalog datasets—languages aligned with Hiligaynon within the Central Philippine subgroup (Teves et al., 12 Oct 2025). No further model updates were applied; predictions relied solely on features learned from Hiligaynon training.
| Target Language | Macro F1 | Precision (Cebuano) | Precision (Tagalog) |
|---|---|---|---|
| Cebuano | 0.44–0.46 | Slightly higher | Slightly lower |
| Tagalog | 0.44–0.46 | – | – |
The results show a macro F1 of approximately 0.44–0.46 for both languages. Higher precision on Cebuano is attributed to its closer lexical and morphosyntactic affinity with Hiligaynon. While transfer scores are lower than in-language performance, the findings demonstrate promising baseline transfer for related low-resource languages, validating both the dataset and methodology for broader regional application.
5. NER Methodologies in the Low-Resource Philippine Context
The development of HiligayNER fits within a broader research thrust examining how meta-learning and multilingual pretraining can address NER challenges under resource-constrained conditions (Africa et al., 2 Sep 2025). Contemporary meta-pretraining objectives—such as hybrid schemes combining autoregressive LM with first-order MAML—are shown to further enhance zero-shot transfer, especially for single-token person entities characterized by prominent surface cues (e.g., the si/ni case particles in Tagalog).
A plausible implication is that deploying meta-pretrained models or integrating episodic meta-learning steps during fine-tuning can sharpen token-level representations and accelerate convergence. For Hiligaynon, such approaches may strengthen entity boundary detection in contexts with clear lexical anchors and may be particularly effective for person name recognition.
6. Implications, Limitations, and Future Directions
HiligayNER constitutes a foundational resource for Hiligaynon computational linguistics, offering an annotated corpus, robust baseline models, and open-source infrastructure. The paper demonstrates that modest dataset size combined with state-of-the-art multilingual Transformers can yield strong NER performance for an underrepresented language.
Key future directions include:
- Dataset Expansion: Increase volume and genre diversity; incorporate additional entity types (e.g., Event, Date).
- Span-level Modeling: Adopt finer-grained objectives and evaluation (span-level), potentially improving boundary detection.
- Domain-Adaptive Pretraining: Leverage region-specific texts for in-domain adaptation.
- Gazetteer Augmentation: Combine neural models with structured lists to aid rare or ambiguous entity resolution.
- Transfer Learning Optimization: Refine methodologies for leveraging syntactic and lexical similarities in zero-shot and few-shot settings, targeting improved cross-lingual robustness.
The ongoing development of cross-lingual NER systems for Philippine languages, coupled with advances in meta-learning objectives and model architecture optimization, lays groundwork for multilingual language technology in low-resource environments and supports equitable access to NLP tools across the region.