KatotohananQA: Filipino LLM Truthfulness Benchmark

Updated 14 September 2025

KatotohananQA is a benchmark that evaluates truthfulness in Filipino LLMs by adapting the established TruthfulQA framework.
It employs a binary-choice methodology to address semantic preservation and cultural translation challenges in low-resource settings.
Comparative analysis reveals significant performance gaps between English and Filipino outputs, guiding improvements in multilingual LLM evaluation.

KatotohananQA is a benchmark designed to evaluate truthfulness in LLMs for the Filipino language, grounded in a systematic translation and adaptation of the widely used TruthfulQA framework. As LLMs are increasingly deployed in diverse linguistic settings, KatotohananQA addresses an essential gap by quantifying truthfulness, hallucination, and multilingual robustness for Filipino—a low-resource language in the context of modern AI benchmarks. The following sections provide a detailed, technical overview of KatotohananQA, including its construction, evaluation methodology, comparative performance analysis, observed disparities, and broader implications for research on multilingual LLM truthfulness (Nery et al., 7 Sep 2025).

1. Construction and Adaptation of the Benchmark

KatotohananQA was created by translating the complete English TruthfulQA benchmark into Filipino, ensuring semantically equivalent, high-fidelity question representation. The process required addressing several linguistic and cultural challenges:

Semantic Preservation: Translators prioritized faithful mapping of concepts, carefully disambiguating polysemous terms and adapting idiomatic expressions into context-fitting Filipino constructs.
Handling Culture-Specific Content: U.S.-centric references, idioms, and named entities were transliterated or, when appropriate, annotated to preserve meaning. For debates with equally plausible answers (e.g., misconceptions prevalent in U.S. culture), the original context was maintained to preserve cross-lingual comparability.
Consistency and Parallelism: Strict guidelines ensured that each Filipino question aligns one-to-one with its English counterpart, supporting both bilingual and cross-lingual evaluations.
Proofreading and Quality Control: Multiple rounds of review were implemented to control for translation artifacts, avoid semantic drift, and validate question clarity and neutrality.

2. Evaluation Framework and Methodology

KatotohananQA adopts a binary-choice framework for assessing model outputs:

Binary-Choice Setup: For each question, models must choose between two candidate answers: the correct (truthful) one and a plausible but false counterpart, mirroring the adversarial intent of the original TruthfulQA benchmark.
Automated Evaluation: Responses are scored for truthfulness—did the model select the correct answer? The framework allows straightforward accuracy computations and aligns with standardized truthfulness evaluation procedures.
Model Selection: Seven free-tier proprietary LLMs, including notable recent generations such as GPT-5 and GPT-5 mini from OpenAI, were evaluated in both Filipino and English.
Zero-Shot and Direct Translation: To assess cross-lingual transfer, each model was prompted in both the original English and the translated Filipino, with no language-specific finetuning or adaptation.

3. Comparative Performance Analysis

Truthfulness Disparity: A significant performance gap exists between English and Filipino truthfulness scores across all evaluated models. For most, the decline is pronounced, indicating limited robustness of current commercially available LLMs in low-resource language scenarios.
Model Robustness: The latest OpenAI models (GPT-5 and GPT-5 mini) demonstrated the highest degree of multilingual robustness among the evaluated models, with notably smaller accuracy gaps between languages than prior generations.
Cross-Model Consistency: The overall trend shows that models generally perform best in English, with truthfulness degradation in Filipino paralleling observations from recent multilingual truthfulness benchmarking efforts in other language groups (Figueras et al., 13 Feb 2025, Bayes et al., 1 Dec 2024).

4. Disparities Across Question Characteristics

Analysis across question types reveals that:

Dimension	Robustness to Transfer	Noted Challenges
Knowledge Type	Universal > Time/context dependent	Socio-cultural, legal, and health questions often invoke context-specific knowledge not present in Filipino training data.
Category Sensitivity	Factual > Misconception	Misconception-laden questions exhibited greater accuracy drops, indicating difficulty in suppressing regionally specific false beliefs.
Format	Short factual > Open-ended	Open-ended or ambiguous questions suffered more from translation-induced clarity loss.

Disparities suggest that certain categories and phrasings (especially those requiring nuanced common-sense or time/contextual referencing) are less robust under language transfer. The benchmark highlights the compositional and distributional boundaries of LLM truthfulness when extending established English tasks into Filipino.

5. Implications and Recommendations for Multilingual LLM Evaluation

Benchmark Expansion Necessity: The observed variation across question types and model families indicates that truthfulness benchmarks should routinely cover multiple languages—especially low-resource languages like Filipino—to ensure equitable and reliable model deployment globally.
Multilingual Robustness as a Critical Evaluation Axis: Results confirm that even state-of-the-art LLMs, though improving, are not uniformly truthful across languages; this necessitates continued research on cross-lingual transfer, tokenization artifacts, and in-domain adaptation techniques for truthfulness.
Targeted Improvement: Patterns of reduced robustness in specific questions and categories can guide model developers toward improved multilingual alignment strategies, such as domain-adaptive training, tokenization normalization, or cultural knowledge augmentation.
Methodological Rigor: Future research should investigate the impact of translation methodology, including professional versus machine translation (Figueras et al., 13 Feb 2025), and develop tools for quantifying and minimizing translation-induced evaluation artifacts.

6. Broader Impact and Future Directions

KatotohananQA fills a critical benchmarking gap by providing the first systematic, validated truthfulness benchmark for Filipino. Its deployment supports:

Fairness in Language Technology: By revealing performance bottlenecks in low-resource settings, it draws attention to the risk of perpetuating inequality in LLM evaluation and deployment.
Model Selection and Policy: Regulators and practitioners can use KatotohananQA results as a standard for evaluating whether an LLM meets minimum truthfulness criteria for Filipino-language applications.
Research Frontiers: The benchmark is positioned to be extended to more Philippine languages and dialects, and to serve as a backbone for cross-lingual truthfulness research, a domain with emergent importance in the post-English-dominant phase of LLM evaluation (Figueras et al., 13 Feb 2025, Bayes et al., 1 Dec 2024).

In sum, KatotohananQA provides a robust, nuanced, and urgently needed framework for understanding and improving the truthfulness of LLMs in Filipino, establishing empirical foundations for multilingual fairness, accountability, and safety in AI. The evidence from initial benchmarking both motivates and grounds future research aimed at closing the gap in truthful LLM performance across the world’s languages.