Latxa: An Open Language Model and Evaluation Suite for Basque (2403.20266v2)

Published 29 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce Latxa, a family of LLMs for Basque ranging from 7 to 70 billion parameters. Latxa is based on Llama 2, which we continue pretraining on a new Basque corpus comprising 4.3M documents and 4.2B tokens. Addressing the scarcity of high-quality benchmarks for Basque, we further introduce 4 multiple choice evaluation datasets: EusProficiency, comprising 5,169 questions from official language proficiency exams; EusReading, comprising 352 reading comprehension questions; EusTrivia, comprising 1,715 trivia questions from 5 knowledge areas; and EusExams, comprising 16,774 questions from public examinations. In our extensive evaluation, Latxa outperforms all previous open models we compare to by a large margin. In addition, it is competitive with GPT-4 Turbo in language proficiency and understanding, despite lagging behind in reading comprehension and knowledge-intensive tasks. Both the Latxa family of models, as well as our new pretraining corpora and evaluation datasets, are publicly available under open licenses. Our suite enables reproducible research on methods to build LLMs for low-resource languages.

PDF Abstract

Latxa: A New Open LLM for Basque with Evaluation Suite

Introduction to Latxa's Contributions

In a significant stride towards inclusivity and diversity in AI's language capabilities, the introduction of Latxa marks an essential advancement for Basque-speaking communities and the broader linguistic research community. This open LLM not only bridges a gap in resources for a low-resource language but does so with remarkable effectiveness, substantially surpassing existing models in performance metrics. The paired release of an extensive evaluation suite tailored to Basque further underscores the holistic approach taken by the researchers in this endeavor.

Overview of Latxa

Latxa is a family of LLMs specifically designed for the Basque language, utilizing sizes ranging from 7 billion to 70 billion parameters. Built upon the foundation of Llama 2, its training involved continued pretraining on an expansive new Basque corpus featuring 4.3 million documents and 4.2 billion tokens. This effort seeks to counterbalance the lack of well-established benchmarks for Basque by introducing four distinct multiple-choice evaluation datasets. Each dataset serves to measure various aspects of language comprehension and proficiency, ranging from language exams to trivia on local knowledge areas.

Performance and Capabilities

The evaluation of Latxa demonstrates its superiority over previously available models, with the 70-billion parameter variant outperforming the nearest competitor model by nearly 19 percentage points on average. Notably, it also achieved competitive results against GPT-4 Turbo in specific areas such as language proficiency, despite falling short in tasks demanding extensive reading comprehension or in-depth knowledge. This aspect is particularly promising as it hints at the core linguistic competence of Latxa in Basque, potentially paving the way for future enhancements in low-resource LLMing.

The Significance of Open Resources

This work places a strong emphasis on the value of openness, offering not just the model but also the pretraining corpus and evaluation datasets under open licenses. Such an approach not only facilitates reproducible research and further model improvements but also encourages a communal effort towards advancing language technology for less-supported languages. The Latxa model sets a precedent for leveraging open ecosystems in mitigating the challenges faced by low-resource languages in the field of AI.

Future Directions and Theoretical Implications

The findings suggest significant potential for continued pretraining as a strategy for enhancing LLMs in low-resource contexts. Given Latxa's success, there is a clear pathway for incorporating stronger base models as they become available, continually elevating the performance ceiling. The research also contributes to an ongoing conversation about the transferability of language-agnostic capabilities in AI, highlighting the distinction between general knowledge and linguistic competence.

The release of Latxa and its accompanying resources not only represents a substantial technical achievement but also signals a broader commitment to linguistic diversity and accessibility within the field of AI. As research progresses, the integration of instruction-following capabilities and the acquisition of more diversified content for training are anticipated to further refine Latxa's utility and impact.

Ethical and Practical Considerations

The development of Latxa is guided by an awareness of the ethical implications associated with training LLMs, from carbon emissions associated with model training to the potential perpetuation of biases present in training data. By releasing comprehensive documentation, including a model card detailing intended uses, limitations, and ethical considerations, the researchers provide transparency and set a responsible precedent for future developments in the field.

In conclusion, the introduction of Latxa significantly contributes to the diversification of AI's linguistic capabilities, underscoring the importance of open resources and collaborative efforts in bridging the technological divide for low-resource languages.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Julen Etxaniz (9 papers)
Oscar Sainz (14 papers)
Naiara Perez (8 papers)
Itziar Aldabe (5 papers)
German Rigau (30 papers)
Eneko Agirre (53 papers)
Aitor Ormazabal (10 papers)
Mikel Artetxe (52 papers)
Aitor Soroa (29 papers)

Citations (9)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/juletxara/status/1778041300060655648

https://twitter.com/knishimae0531/status/1778216317537153434