Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GlórIA -- A Generative and Open Large Language Model for Portuguese (2402.12969v1)

Published 20 Feb 2024 in cs.CL and cs.AI

Abstract: Significant strides have been made in natural language tasks, largely attributed to the emergence of powerful LLMs. These models, pre-trained on extensive and diverse corpora, have become increasingly capable of comprehending the intricacies of language. Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese. We introduce Gl\'orIA, a robust European Portuguese decoder LLM. To pre-train Gl\'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. We present our pre-training methodology, followed by an assessment of the model's effectiveness on multiple downstream tasks. Additionally, to evaluate our models' LLMing capabilities, we introduce CALAME-PT (Context-Aware LLMing Evaluation for Portuguese), the first Portuguese zero-shot language-modeling benchmark. Evaluation shows that Gl\'orIA significantly outperforms existing open PT decoder models in LLMing and that it can generate sound, knowledge-rich, and coherent PT-PT text. The model also exhibits strong potential for various downstream tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Towards a cleaner document-oriented multilingual crawled corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4344–4355, Marseille, France. European Language Resources Association.
  2. Longformer: The long-document transformer. arXiv:2004.05150.
  3. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  4. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  6. PTT5: pretraining and validating the T5 model on brazilian portuguese data. CoRR, abs/2008.09144.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  8. Twiz: The wizard of multimodal conversational-stimulus. In Alexa Prize TaskBot Challenge 2 Proceedings.
  9. Introducing the portuguese web archive initiative.
  10. DCEP -digital corpus of the European parliament. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association (ELRA).
  11. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
  12. A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2763–2775, Dublin, Ireland. Association for Computational Linguistics.
  13. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand.
  14. Factuality enhanced language models for open-ended text generation. In Advances in Neural Information Processing Systems, volume 35, pages 34586–34599. Curran Associates, Inc.
  15. Bernardo Leite and Henrique Lopes Cardoso. 2022. Neural question generation for the portuguese language: A preliminary study. In Progress in Artificial Intelligence, pages 780–793, Cham. Springer International Publishing.
  16. Pierre Lison and Jörg Tiedemann. 2016. OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 923–929, Portorož, Slovenia. European Language Resources Association (ELRA).
  17. Visual instruction tuning. In Advances in Neural Information Processing Systems.
  18. CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219, Online. Association for Computational Linguistics.
  19. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  20. Clueweb22: 10 billion web documents with rich information. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 3360–3362, New York, NY, USA. Association for Computing Machinery.
  21. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, Berlin, Germany. Association for Computational Linguistics.
  22. Sabiá: Portuguese large language models. In Intelligent Systems, pages 226–240, Cham. Springer Nature Switzerland.
  23. Language models are unsupervised multitask learners.
  24. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
  25. Increasing faithfulness in knowledge-grounded dialogue with controllable features. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 704–718, Online. Association for Computational Linguistics.
  26. The assin 2 shared task: a quick overview. In International Conference on Computational Processing of the Portuguese Language, pages 406–412. Springer.
  27. Advancing neural encoding of portuguese with transformer albertina pt-*.
  28. A gpt-2 language model for biomedical texts in portuguese. In 2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS), pages 474–479.
  29. mgpt: Few-shot learners go multilingual.
  30. Plan-grounded large language models for dual goal conversational settings. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL), St. Julian’s, Malta. Association for Computational Linguistics.
  31. Bertimbau: Pretrained bert models for brazilian portuguese. In Brazilian Conference on Intelligent Systems.
  32. Llama: Open and efficient foundation language models.
  33. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  34. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  35. Multilingual is not enough: Bert for finnish. ArXiv, abs/1912.07076.
  36. The brWaC corpus: A new open resource for Brazilian Portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  37. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
  38. Bloom: A 176b-parameter open-access multilingual language model.
  39. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  40. Efficient attention: Attention with linear complexities. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 3530–3538.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ricardo Lopes (3 papers)
  2. João Magalhães (35 papers)
  3. David Semedo (20 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.