2000 character limit reached
PeLLE: Encoder-based language models for Brazilian Portuguese based on open data (2402.19204v1)
Published 29 Feb 2024 in cs.CL
Abstract: In this paper we present PeLLE, a family of LLMs based on the RoBERTa architecture, for Brazilian Portuguese, trained on curated, open data from the Carolina corpus. Aiming at reproducible results, we describe details of the pretraining of the models. We also evaluate PeLLE models against a set of existing multilingual and PT-BR refined pretrained Transformer-based LLM encoders, contrasting performance of large versus smaller-but-curated pretrained models in several downstream tasks. We conclude that several tasks perform better with larger models, but some tasks benefit from smaller-but-curated data in its pretraining.
- Towards a cleaner document-oriented multilingual crawled corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4344–4355, Marseille, France. European Language Resources Association.
- On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637. Association for Computational Linguistics.
- WaCky! working papers on the web as corpus. Gedit.
- The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 43:209–226.
- Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- Carolina: a general corpus of contemporary brazilian portuguese with provenance, typology and versioning information. arXiv e-prints, pages arXiv–2303.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Assin: Avaliacao de similaridade semantica e inferencia textual. In Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal, pages 13–15.
- Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.
- How to train BERT with an academic budget. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10644–10652. Association for Computational Linguistics.
- Population based training of neural networks. arXiv preprint arXiv:1711.09846.
- The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Carolina: um córpus geral do português brasileiro com proveniência, tipologia e versionamento. note = Accessed: 2023–09-18.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
- The assin 2 shared task: a quick overview. In Computational Processing of the Portuguese Language: 14th International Conference, PROPOR 2020, Evora, Portugal, March 2–4, 2020, Proceedings 14, pages 406–412. Springer.
- Advancing neural encoding of portuguese with transformer albertina pt. arXiv preprint arXiv:2305.06721.
- Bertimbau: pretrained bert models for brazilian portuguese. In Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I 9, pages 403–417. Springer.
- Carolina’s methodology: building a large corpus with provenance and typology information. In DHandNLP@ PROPOR, pages 53–58.
- Scale efficiently: Insights from pretraining and finetuning transformers. In International Conference on Learning Representations.
- HateBR: A large expert annotated corpus of Brazilian Instagram comments for offensive language and hate speech detection. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 7174–7183, Marseille, France. European Language Resources Association.
- Attention is all you need. Advances in neural information processing systems, 30.
- The brWaC corpus: A new open resource for Brazilian Portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.