Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Advancing Neural Encoding of Portuguese with Transformer Albertina PT-* (2305.06721v2)

Published 11 May 2023 in cs.CL

Abstract: To advance the neural encoding of Portuguese (PT), and a fortiori the technological preparation of this language for the digital age, we developed a Transformer-based foundation model that sets a new state of the art in this respect for two of its variants, namely European Portuguese from Portugal (PT-PT) and American Portuguese from Brazil (PT-BR). To develop this encoder, which we named Albertina PT-*, a strong model was used as a starting point, DeBERTa, and its pre-training was done over data sets of Portuguese, namely over data sets we gathered for PT-PT and PT-BR, and over the brWaC corpus for PT-BR. The performance of Albertina and competing models was assessed by evaluating them on prominent downstream language processing tasks adapted for Portuguese. Both Albertina PT-PT and PT-BR versions are distributed free of charge and under the most permissive license possible and can be run on consumer-grade hardware, thus seeking to contribute to the advancement of research and innovation in language technology for Portuguese.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Towards a cleaner document-oriented multilingual crawled corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC), pages 4344–4355.
  2. Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In Findings of the Association for Computational Linguistics (ACL-IJCNLP), pages 4933–4946.
  3. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations (ICLR).
  4. A neural probabilistic language model. Advances in Neural Information Processing Systems, 13.
  5. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  6. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.
  7. BERTje: A Dutch BERT model. arXiv preprint arXiv:1912.09582.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4171–4186.
  9. Gomes, J. R. S. (2020). PLUE: Portuguese language understanding evaluation. https://github.com/ju-resplande/PLUE.
  10. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate.
  11. MarIA: Spanish language models. Procesamiento del Lenguaje Natural, pages 39–60.
  12. DCEP -digital corpus of the European parliament. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC).
  13. DeBERTa: Decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations.
  14. HuggingFace (2023). Hugging Face. https://huggingface.co/. Accessed: April 2023.
  15. Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86.
  16. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71.
  17. The BigScience ROOTS corpus: A 1.6TB composite multilingual dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  18. Deep learning. Nature, 521(7553):436–444.
  19. CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219.
  20. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  21. Generating a European Portuguese BERT based model using content from Arquivo.pt archive. In Proceedings of the Intelligent Data Engineering and Automated Learning 23rd International Conference (IDEAL), pages 280–288.
  22. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359.
  23. To tune or not to tune? adapting pretrained representations to diverse tasks. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP), pages 7–14.
  24. The ASSIN 2 shared task: a quick overview. In 14th International Conference on the Computational Processing of the Portuguese Language (PROPOR), pages 406–412. Springer.
  25. BioBERTpt—a Portuguese neural language model for clinical named entity recognition. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 65–72. Association for Computational Linguistics.
  26. BERTimbau: pretrained BERT models for Brazilian Portuguese. In Intelligent Systems: 9th Brazilian Conference (BRACIS), pages 403–417. Springer.
  27. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137.
  28. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 27.
  29. Attention is all you need. Advances in Neural Information Processing Systems, 30.
  30. Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076.
  31. The brWaC corpus: a new open resource for Brazilian Portuguese. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC).
  32. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in Neural Information Processing Systems, 32.
  33. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the EMNLP Workshop BlackboxNLP, pages 353–355.
  34. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. João Rodrigues (17 papers)
  2. Luís Gomes (7 papers)
  3. João Silva (10 papers)
  4. António Branco (14 papers)
  5. Rodrigo Santos (10 papers)
  6. Henrique Lopes Cardoso (13 papers)
  7. Tomás Osório (2 papers)
Citations (41)

Summary

We haven't generated a summary for this paper yet.