GlórIA -- A Generative and Open Large Language Model for Portuguese (2402.12969v1)

Published 20 Feb 2024 in cs.CL and cs.AI

Abstract: Significant strides have been made in natural language tasks, largely attributed to the emergence of powerful LLMs. These models, pre-trained on extensive and diverse corpora, have become increasingly capable of comprehending the intricacies of language. Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese. We introduce Gl\'orIA, a robust European Portuguese decoder LLM. To pre-train Gl\'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. We present our pre-training methodology, followed by an assessment of the model's effectiveness on multiple downstream tasks. Additionally, to evaluate our models' LLMing capabilities, we introduce CALAME-PT (Context-Aware LLMing Evaluation for Portuguese), the first Portuguese zero-shot language-modeling benchmark. Evaluation shows that Gl\'orIA significantly outperforms existing open PT decoder models in LLMing and that it can generate sound, knowledge-rich, and coherent PT-PT text. The model also exhibits strong potential for various downstream tasks.

References (40)

Authors (3)

Ricardo Lopes (3 papers)
João Magalhães (35 papers)
David Semedo (20 papers)

Citations (6)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GlórIA -- A Generative and Open Large Language Model for Portuguese (2402.12969v1)

Summary

Related Papers