FilBench: Can LLMs Understand and Generate Filipino? (2508.03523v1)

Published 5 Aug 2025 in cs.CL

Abstract: Despite the impressive performance of LLMs on English-based tasks, little is known about their capabilities in specific languages such as Filipino. In this work, we address this gap by introducing FilBench, a Filipino-centric benchmark designed to evaluate LLMs across a diverse set of tasks and capabilities in Filipino, Tagalog, and Cebuano. We carefully curate the tasks in FilBench to reflect the priorities and trends of NLP research in the Philippines such as Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation. By evaluating 27 state-of-the-art LLMs on FilBench, we find that several LLMs suffer from reading comprehension and translation capabilities. Our results indicate that FilBench is challenging, with the best model, GPT-4o, achieving only a score of 72.23%. Moreover, we also find that models trained specifically for Southeast Asian languages tend to underperform on FilBench, with the highest-performing model, SEA-LION v3 70B, achieving only a score of 61.07%. Our work demonstrates the value of curating language-specific LLM benchmarks to aid in driving progress on Filipino NLP and increasing the inclusion of Philippine languages in LLM development.

Summary

The paper evaluates state-of-the-art LLMs on Filipino using a benchmark covering cultural, classical NLP, reading comprehension, and generation tasks.
It reveals strong performance in classical NLP tasks but significant failures in generation and translation, with GPT-4o leading at 72.23% accuracy.
Language-specific finetuning and few-shot prompting are shown to enhance model performance for low-resource Filipino language tasks.

"FilBench: Can LLMs Understand and Generate Filipino?" Review

Introduction

"FilBench" explores the capabilities of LLMs in understanding and generating content in Filipino, a language with unique linguistic characteristics and substantial speaker population. The paper introduces FilBench, a benchmark specifically designed to assess the proficiency of LLMs in Filipino, Tagalog, and Cebuano.

Figure 1: Overview of FilBench illustrating four categories and twelve subtasks across major Philippine languages.

Despite considerable development in LLM technologies, their effectiveness for languages like Filipino remains under-addressed. The benchmark reflects Philippine NLP's research priorities encompassing Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation. Through a detailed evaluation of 27 state-of-the-art models, it highlights existing models' limitations, notably in reading comprehension and translation, and advocates for tailored model training.

FilBench Composition

FilBench includes diverse tasks aimed at thoroughly assessing LLM performance in Philippine languages. These tasks are derived from curated datasets reflecting the aforementioned categories:

Cultural Knowledge Assessment examines models' grasp of culturally-specific contexts, using resources like regional quizzes and word-sense disambiguation tasks.
Classical NLP includes traditional tasks like named entity recognition and sentiment analysis, relying on datasets such as CebuaNER and TLUnified-NER.
Reading Comprehension challenges include natural language inference and readability adapted tests from sources like Belebele and Cebuano Readability Corpus.
Generation involves translation tasks from English and Cebuano to Filipino, based on datasets like NTREX-128 and Tatoeba.
Figure 2: Performance trends in FilBench indicate effective performance in Classical NLP but poor execution in Generation.

This structure ensures comprehensive language evaluation, particularly focusing on low-resource scenarios prevalent with these languages. FilBench's scoring system employs a weighted average across categories, enabling clear performance comparison.

Evaluation Results

FilBench's results reveal significant performance disparities among evaluated models. GPT-4o emerges as the leading model with approximately 72.23% in overall accuracy, but critical failures are noted in generation tasks with scores significantly below expectation (average GN: 17.03%). SEA-specific models demonstrate parameter efficiency, where models like SEA-LION v3 showcase competitive scoring (61.07%) despite lower scores relative to Filipino challenges.

Figure 3: Parameter-efficiency of LLMs indicating SEA-specific models' relative efficacy.

Notably, language-specific finetuning enhances proficiency remarkably. Models trained on Southeast Asian languages outperform general-purpose LLMs, implying substantial benefits from focused training datasets. Additionally, few-shot prompting mitigates Generation task failures, highlighting potential methods for improving translation performance.

Analysis of Model Behavior

The paper analyzes cases where LLMs perform well or fail on Filipino-centric tasks:

Agreement on Correct Answers: Models often agree with human judgment on text classification and NER tasks when prompts are correctly structured.
Disagreement and Failure Modes: Elevated discrepancies occur in nuanced tasks such as cultural knowledge assessments, with models inadequately processing complex context, often resulting from overgeneration and incorrect instruction adherence.
Translation Challenges: Common failures in translation reveal instruction-following inconsistency and hallucination, especially noticeable in Tatoeba translations. Models like SEA-LION v3 70B still struggle with examples requiring semantic fidelity.
Figure 4: Few-shot examples demonstrate improved generation scores mitigating instruction-following issues.

Implications and Future Developments

The insights from FilBench highlight pressing needs for targeted model development that enhance Filipino language processing. There's potential for building Filipino-specific LLMs given the effective parameter-utilization shown by SEA-specific models. FilBench underscores critical areas for computational focus, such as data augmentation in generation tasks, and emphasizes community-driven benchmarks like Batayan.

(Likely future work includes augmentation of task datasets, expansion into more Philippine languages, and leveraging cross-lingual transfer for improved model adaptability.)

Conclusion

FilBench offers a critical appraisal of LLM capabilities regarding Filipino languages, illustrating performance limits in generation tasks while providing a framework for language-focused LLM improvements. Such benchmarks are instrumental in directing AI advancements towards inclusivity and effective linguistic representation. Future developments should consider enriching the language-model space further, ensuring the advancement of AI that respects and caters to linguistic diversity.