Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fostering the Ecosystem of Open Neural Encoders for Portuguese with Albertina PT* Family (2403.01897v2)

Published 4 Mar 2024 in cs.CL

Abstract: To foster the neural encoding of Portuguese, this paper contributes foundation encoder models that represent an expansion of the still very scarce ecosystem of LLMs specifically developed for this language that are fully open, in the sense that they are open source and openly distributed for free under an open license for any purpose, thus including research and commercial usages. Like most languages other than English, Portuguese is low-resourced in terms of these foundational language resources, there being the inaugural 900 million parameter Albertina and 335 million Bertimbau. Taking this couple of models as an inaugural set, we present the extension of the ecosystem of state-of-the-art open encoders for Portuguese with a larger, top performance-driven model with 1.5 billion parameters, and a smaller, efficiency-driven model with 100 million parameters. While achieving this primary goal, further results that are relevant for this ecosystem were obtained as well, namely new datasets for Portuguese based on the SuperGLUE benchmark, which we also distribute openly.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Towards a cleaner document-oriented multilingual crawled corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC), pages 4344–4355.
  2. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.
  3. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  4. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
  5. Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Proceedings of the 33rd International Conference on Neural Information Processing Systems.
  6. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  7. J. R. S. Gomes. 2020. Plue: Portuguese language understanding evaluation. https://github.com/ju-resplande/PLUE.
  8. MarIA: Spanish language models. Procesamiento del Lenguaje Natural, pages 39–60.
  9. DCEP—Digital corpus of the European Parliament. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14).
  10. DeBERTa: Decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations.
  11. What changes can large-scale language models bring? Intensive study on HyperCLOVA: Billions-scale Korean generative pretrained transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3405–3424, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  12. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: papers, pages 79–86.
  13. The BigScience ROOTS corpus: A 1.6 TB composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826.
  14. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
  15. CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219.
  16. CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. arXiv preprint arXiv:2309.09400.
  17. Extraglue datasets and models: Kick-starting a benchmark for the neural processing of portuguese.
  18. Multilingual BERT has an accent: Evaluating English influences on fluency in multilingual models. In Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 143–146. Association for Computational Linguistics.
  19. Sabiá: Portuguese large language models. arXiv preprint arXiv:2304.07880.
  20. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21:5485–5551.
  21. The assin 2 shared task: a quick overview. In Computational Processing of the Portuguese Language: 14th International Conference, PROPOR 2020, Evora, Portugal, March 2–4, 2020, Proceedings 14, pages 406–412. Springer.
  22. Georg Rehm and Andy Way, editors. 2023. European Language Equality: A Strategic Agenda for Digital Language Equality. Cognitive Technologies. Springer.
  23. Advancing neural encoding of Portuguese with transformer Albertina PT-*. In Progress in Artificial Intelligence (EPIA 2023).
  24. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  25. BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In Intelligent Systems, pages 403–417.
  26. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137.
  27. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  28. Attention is all you need. Advances in neural information processing systems, 30.
  29. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  30. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355.
  31. Shijie Wu and Mark Dredze. 2019. Beto, Bentz, Becas: The surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 833–844, Hong Kong, China. Association for Computational Linguistics.
  32. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498.
  33. Toward efficient language model pretraining and downstream adaptation via self-evolution: A case study on SuperGLUE. arXiv preprint arXiv:2212.01853.
  34. Fábio Souza and Rodrigo Nogueira and Roberto Lotufo. 2020. BERTimbau Large. Hugging Face.
  35. J. R. S. Gomes. 2020. PLUE: Portuguese Language Understanding Evaluation. Hugging Face.
  36. DCEP: Digital Corpus of the European Parliament. European Parliament - DG TRAD. European Parliament - DG TRAD, ISLRN 823-807-024-162-2.
  37. João Rodrigues and Luís Gomes and João Silva and António Branco and Rodrigo Santos and Henrique Lopes Cardoso and Tomás Osório. 2023a. Albertina PT-BR. PORTULAN CLARIN. distributed via PORTULAN CLARIN. PID https://hdl.handle.net/21.11129/0000-000F-FF43-7.
  38. João Rodrigues and Luís Gomes and João Silva and António Branco and Rodrigo Santos and Henrique Lopes Cardoso and Tomás Osório. 2023b. Albertina PT-PT. PORTULAN CLARIN. distributed via PORTULAN CLARIN. PID https://hdl.handle.net/21.11129/0000-000F-FF42-8.
  39. Julien Abadji and Pedro Ortiz Suarez and Laurent Romary and Benoît Sagot. 2023. OSCAR 23.01 – Open Source Project on Multilingual Resources for Machine Learning. the OSCAR project.
  40. Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen. 2023. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. Microsoft.
  41. Philipp Koehn. 2012. European Parliament Proceedings Parallel Corpus (v7). EuroMatrixPlus project.
  42. ASSIN 2 (The ASSIN 2 Shared Task: A Quick Overview). Hugging Face.
  43. Thuat Nguyen and Chien Van Nguyen and Viet Dac Lai and Hieu Man and Nghia Trung Ngo and Franck Dernoncourt and Ryan A. Rossi and Thien Huu Nguyen. 2023. CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. Hugging Face.
  44. Superglue: A stickier benchmark for general-purpose language understanding systems. Hugging Face.
  45. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Hugging Face.
Citations (7)

Summary

We haven't generated a summary for this paper yet.