Advancing Neural Encoding of Portuguese with Transformer Albertina PT-* (2305.06721v2)
Abstract: To advance the neural encoding of Portuguese (PT), and a fortiori the technological preparation of this language for the digital age, we developed a Transformer-based foundation model that sets a new state of the art in this respect for two of its variants, namely European Portuguese from Portugal (PT-PT) and American Portuguese from Brazil (PT-BR). To develop this encoder, which we named Albertina PT-*, a strong model was used as a starting point, DeBERTa, and its pre-training was done over data sets of Portuguese, namely over data sets we gathered for PT-PT and PT-BR, and over the brWaC corpus for PT-BR. The performance of Albertina and competing models was assessed by evaluating them on prominent downstream language processing tasks adapted for Portuguese. Both Albertina PT-PT and PT-BR versions are distributed free of charge and under the most permissive license possible and can be run on consumer-grade hardware, thus seeking to contribute to the advancement of research and innovation in language technology for Portuguese.
- Towards a cleaner document-oriented multilingual crawled corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC), pages 4344–4355.
- Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In Findings of the Association for Computational Linguistics (ACL-IJCNLP), pages 4933–4946.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations (ICLR).
- A neural probabilistic language model. Advances in Neural Information Processing Systems, 13.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.
- BERTje: A Dutch BERT model. arXiv preprint arXiv:1912.09582.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4171–4186.
- Gomes, J. R. S. (2020). PLUE: Portuguese language understanding evaluation. https://github.com/ju-resplande/PLUE.
- Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate.
- MarIA: Spanish language models. Procesamiento del Lenguaje Natural, pages 39–60.
- DCEP -digital corpus of the European parliament. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC).
- DeBERTa: Decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations.
- HuggingFace (2023). Hugging Face. https://huggingface.co/. Accessed: April 2023.
- Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86.
- SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71.
- The BigScience ROOTS corpus: A 1.6TB composite multilingual dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Deep learning. Nature, 521(7553):436–444.
- CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- Generating a European Portuguese BERT based model using content from Arquivo.pt archive. In Proceedings of the Intelligent Data Engineering and Automated Learning 23rd International Conference (IDEAL), pages 280–288.
- A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359.
- To tune or not to tune? adapting pretrained representations to diverse tasks. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP), pages 7–14.
- The ASSIN 2 shared task: a quick overview. In 14th International Conference on the Computational Processing of the Portuguese Language (PROPOR), pages 406–412. Springer.
- BioBERTpt—a Portuguese neural language model for clinical named entity recognition. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 65–72. Association for Computational Linguistics.
- BERTimbau: pretrained BERT models for Brazilian Portuguese. In Intelligent Systems: 9th Brazilian Conference (BRACIS), pages 403–417. Springer.
- Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137.
- Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 27.
- Attention is all you need. Advances in Neural Information Processing Systems, 30.
- Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076.
- The brWaC corpus: a new open resource for Brazilian Portuguese. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC).
- Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in Neural Information Processing Systems, 32.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the EMNLP Workshop BlackboxNLP, pages 353–355.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45.
- João Rodrigues (17 papers)
- Luís Gomes (7 papers)
- João Silva (10 papers)
- António Branco (14 papers)
- Rodrigo Santos (10 papers)
- Henrique Lopes Cardoso (13 papers)
- Tomás Osório (2 papers)