FilBench: Filipino NLP Benchmark for LLMs

Updated 13 August 2025

FilBench is a Filipino-centric benchmark evaluating LLMs on culturally and linguistically relevant tasks in Filipino, Tagalog, and Cebuano.
It measures performance across cultural knowledge, classical NLP, reading comprehension, and translation using rigorous MCF and ROUGE-L metrics.
Empirical results highlight LLM strengths in classification yet reveal deficits in reading and generation, guiding future region-specific model improvements.

FilBench is a Filipino-centric benchmark designed to measure the capabilities of LLMs on tasks relevant to Filipino, Tagalog, and Cebuano, with emphasis on reflecting research priorities and trends in NLP within the Philippines. Originating in response to the lack of representation of Philippine languages in existing evaluation suites, FilBench rigorously assesses LLM performance over a diverse set of linguistic and cultural dimensions, enabling targeted diagnosis of model weaknesses and informing directions for regionally appropriate model development (Miranda et al., 5 Aug 2025).

1. Motivation and Definition

FilBench addresses the underrepresentation and poor assessment of Filipino and related languages in mainstream NLP benchmarks, where most existing frameworks and datasets focus on high-resource languages such as English. The benchmark is structured to capture not only language proficiency but also culturally situated knowledge and task-specific pragmatic nuances essential in Philippine linguistic contexts. The significance of this approach lies in its dual role: diagnosing model shortcomings (e.g., reading comprehension, translation fidelity) and driving the post-training collection and adaptation efforts required for effective Filipino-centric model advancement.

2. Benchmark Structure and Task Categories

FilBench comprises four major categories, each mapped to central priorities in the Filipino NLP research landscape:

Category	Target Competence	Example Datasets
Cultural Knowledge (CK)	Regional facts, word-sense, values	INCLUDE, KALAHI
Classical NLP (CN)	Classification, NER, sentiment, MCF format	Dengue Filipino, BalitaNLP, CebuaNER
Reading Comprehension (RC)	Text understanding, readability, NLI	Cebuano Readability Corpus, Belebele
Generation (GN)	Machine translation, natural language output	Gold-standard translation challenges

Multiple-choice formulations (MCF) are standard for CK, CN, and RC, enabling direct comparison with gold labels. Generative tasks focus on translation due to dataset constraints, with evaluation based on overlap-oriented metrics.

3. Evaluation Methodology and Scoring

FilBench adopts a unified strategy to aggregate model performance across categories:

Accuracy ( $\text{Accuracy} = \frac{\text{Correct Responses}}{\text{Total Examples}}$ ) is used for MCF tasks (CK, CN, RC).
The ROUGE-L metric, which measures the longest common subsequence between model output and reference texts, is used for Generation tasks.
The overall FilBench Score is computed as a weighted average across all categories, proportional to their respective dataset sizes:

$\text{FilBench Score} = 100 \times \frac{\sum_{i \in \{\text{CN}, \text{CK}, \text{RC}, \text{GN}\}} n_i \cdot S_i}{\sum_{i \in \{\text{CN}, \text{CK}, \text{RC}, \text{GN}\}} n_i}$

where $n_i$ is the number of examples and $S_i$ the subscore for category $i$ .

This rigorous scoring protocol enables nuanced discrimination of model strengths and weaknesses by fine-grained linguistic skill and cultural relevance.

4. Evaluation Results and Analysis

Twenty-seven state-of-the-art LLMs were assessed using FilBench, including both commercial general-purpose models and those trained specifically for Southeast Asian language contexts. Key findings include:

The highest aggregate FilBench Score is achieved by GPT-4o (≈ 72.73%), with category subscores such as 89.03% for Classical NLP and 80.12% for Reading Comprehension, but only 46.48% for Generation.
SEA-LION v3 70B, adapted for Southeast Asia, reaches 61.07% overall, performing well on CK and CN (e.g., ≈76.78% and ≈89.99%) but much lower on RC (53.56%) and GN (23.95%).
Models routinely excel on classification and knowledge recall, yet exhibit substantial deficits in reading comprehension and generation, especially in translation-oriented tasks—a result attributed to frequent hallucination, overgeneration, and unreliable instruction-following.

A figure in the source paper illustrates category-wise model performance, highlighting the marked drop in generative task results even among models optimized for Southeast Asian languages. This suggests current generative modeling techniques for Filipino remain unreliable compared to their classification-focused capabilities.

5. Implications for Filipino NLP Research and Practical Deployment

FilBench reveals that proficiency on high-resource benchmarks (e.g., English) does not transfer to low-resource Philippine languages. The observed discrepancies demonstrate several limitations:

LLMs' inadequate handling of complex reading comprehension and accurate text generation in Filipino restricts their applicability in domains such as education, legal, and governmental communications.
The need for culturally and linguistically tailored benchmarks is substantiated, as FilBench exposes both linguistic and pragmatic gaps not captured by aggregate multilingual evaluations.
The trade-off between model sophistication and inference cost is highlighted, particularly as resource-constrained contexts in the Philippines demand efficient deployment. This is detailed in analyses contrasting cost-effectiveness with parameter efficiency for models like GPT-4o and SEA-LION.

6. Future Directions and Benchmark Development

Multiple pathways for expanding FilBench and enhancing LLM performance in Filipino are identified:

Improve generative accuracy, notably for translation tasks, by addressing failure modes such as hallucination and faulty instruction-following. Few-shot prompting has shown preliminary benefit but does not fully resolve semantic deficits.
Augment Filipino-centric data for fine-tuning, including expanding parallel corpora and diverse generative datasets.
Extend FilBench beyond Tagalog and Cebuano, contingent on data availability for additional Philippine languages.
Refine task templates to better disambiguate prompts and strengthen cultural fidelity.
Forthcoming research could explore benchmark transferability and adaptation strategies across Southeast Asian languages, given empirically measured correlations (e.g., Spearman ρ ≈ 0.57–0.76 between FilBench and other SEA benchmarks).

A plausible implication is that continuous and region-specific fine-tuning, coupled with robust evaluation, will be necessary for practical deployment of LLMs in the Philippines, especially for use cases requiring trustworthy generation and text comprehension.

7. Conclusion

FilBench establishes itself as a comprehensive, authoritative benchmark for the evaluation of LLMs on Filipino-language tasks. By integrating multiple dimensions—cultural knowledge, classical NLP, reading comprehension, and natural language generation—FilBench provides a granular picture of model performance relevant for both academic research and practical innovation in Philippine language technologies. The benchmark’s aggregation formula, rigorous scoring, and empirical results collectively underscore the challenges in generative modeling for Filipino and inform the next steps in building regionally appropriate NLP solutions (Miranda et al., 5 Aug 2025). Advances will require both expansion of FilBench and targeted development of Filipino-centric training data and post-training methodologies.

PDF Markdown Chat (Pro)

References (1)

FilBench: Can LLMs Understand and Generate Filipino? (2025)

Follow Topic

Get notified by email when new papers are published related to FilBench.