IberBench: LLM Evaluation on Iberian Languages (2504.16921v1)

Published 23 Apr 2025 in cs.CL

Abstract: LLMs remain difficult to evaluate comprehensively, particularly for languages other than English, where high-quality data is often limited. Existing benchmarks and leaderboards are predominantly English-centric, with only a few addressing other languages. These benchmarks fall short in several key areas: they overlook the diversity of language varieties, prioritize fundamental NLP capabilities over tasks of industrial relevance, and are static. With these aspects in mind, we present IberBench, a comprehensive and extensible benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks, in languages spoken across the Iberian Peninsula and Ibero-America. IberBench integrates 101 datasets from evaluation campaigns and recent benchmarks, covering 22 task categories such as sentiment and emotion analysis, toxicity detection, and summarization. The benchmark addresses key limitations in current evaluation practices, such as the lack of linguistic diversity and static evaluation setups by enabling continual updates and community-driven model and dataset submissions moderated by a committee of experts. We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations. Our findings indicate that (i) LLMs perform worse on industry-relevant tasks than in fundamental ones, (ii) performance is on average lower for Galician and Basque, (iii) some tasks show results close to random, and (iv) in other tasks LLMs perform above random but below shared task systems. IberBench offers open-source implementations for the entire evaluation pipeline, including dataset normalization and hosting, incremental evaluation of LLMs, and a publicly accessible leaderboard.

Summary

IberBench: A Multilingual Benchmark for Evaluating LLMs on Iberian Languages

The academic paper presents IberBench, a benchmark designed to evaluate the performance of LLMs on languages primarily spoken across the Iberian Peninsula and Ibero-America. This benchmark emerges in response to the growing necessity for a comprehensive evaluation platform that extends beyond the largely English-centric focus observed in existing benchmarks. IberBench addresses the pressing need for evaluation systems that encapsulate a greater diversity of languages, especially those with numerous speakers yet limited representation in LLM evaluations, such as Spanish, Portuguese, Catalan, Basque, Galician, and several Spanish varieties like Mexican and Cuban.

Core Contributions

The paper outlines several key contributions made through IberBench:

Dataset Integration: IberBench integrates a total of 101 datasets collected from evaluation campaigns, including IberLEF, IberEval, TASS, and PAN, among others. These datasets span over 22 task categories, including sentiment analysis, emotion detection, and several industry-relevant tasks not commonly prioritized in conventional LLM evaluations.
Scalability and Extensibility: Unlike static benchmarks, IberBench is designed for scalability and extensibility, enabling continual integration of novel datasets and model submissions. This adaptability is supported by open-source implementations and an accessible leaderboard for ongoing community interaction and contribution.
Empirical Evaluation: The authors evaluate 23 LLMs comprising models of varying parameter sizes, ranging from 100 million to 14 billion, highlighting their strengths and inherent limitations when applied to tasks requiring fundamental NLP capabilities alongside those with economic relevance.

Evaluative Insights

The paper establishes several findings from the evaluations conducted using IberBench:

Task Performance Divergence: LLMs generally exhibit lower performance on industry-centric tasks than on fundamental language tasks like reading comprehension or question answering. This indicates an ongoing gap that needs to be addressed in LLM development to enhance their real-world applicability and utility.
Linguistic Challenges in Lesser-Resourced Languages: Galician and Basque, despite their notable speaker populations, present substantial challenges. The scarcity of resources for these languages is reflected in the comparatively lower performance of LLMs, underscoring the necessity for continued investment in resource creation for these languages.
Benchmarking Against Existing Systems: In several tasks, such as sentiment analysis and irony detection, LLMs perform better than random baselines but still fall short of the top systems in shared tasks. This suggests that while LLMs are advancing in their capabilities, there remains significant room for improvement, particularly in task-specific fine-tuning.

Implications and Future Research Directions

IberBench holds significant implications for future developments in AI, particularly at the intersection of language technology and multilingual model training. As LLMs continue to evolve, benchmarks like IberBench will be integral in guiding the focus of research efforts and identifying areas requiring enhancement. The disparity observed between performance on fundamental vs. industry-relevant tasks suggests future research should prioritize holistic improvements that encompass real-world applicability. Moreover, leveraging IberBench's open-source framework can stimulate collaborative research efforts aimed at enhancing language support across lesser-resourced languages.

Looking ahead, the researchers plan to broaden IberBench’s dataset pool as more evaluation campaigns are conducted, fostering a richer linguistic diversity and broader task representation. This initiative, coupled with the active engagement of the research community, holds promise for advancing the capabilities of multilingual LLMs and ensuring more equitable language representation in AI evaluations.

Related Papers

Tweets

https://twitter.com/zuckerbarge/status/1915545101159354666

YouTube

Show All Videos