PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data

Published 20 Aug 2020 in cs.CL | (2008.09144v2)

Abstract: In NLP, there is a need for more resources in Portuguese, since much of the data used in the state-of-the-art research is in other languages. In this paper, we pretrain a T5 model on the BrWac corpus, an extensive collection of web pages in Portuguese, and evaluate its performance against other Portuguese pretrained models and multilingual models on three different tasks. We show that our Portuguese pretrained models have significantly better performance over the original T5 models. Moreover, we demonstrate the positive impact of using a Portuguese vocabulary. Our code and models are available at https://github.com/unicamp-dl/PTT5.

Abstract PDF Upgrade to Chat

Citations (51)

View on Semantic Scholar

Summary

The paper demonstrates that pretraining T5 on Brazilian Portuguese with a custom vocabulary enhances model performance on language-specific tasks.
It details a denoising pretraining strategy using the BrWac corpus and SentencePiece for vocabulary refinement tailored to Portuguese.
Experimental results reveal that PTT5 outperforms multilingual T5 models with faster convergence and competitive scores on ASSIN 2 and HAREM tasks.

An Analysis of PTT5: Pretraining and Validating the T5 Model on Brazilian Portuguese Data

This paper presents PTT5, an adaptation of the T5 model tailored specifically for Brazilian Portuguese by leveraging the BrWac corpus. The study demonstrates that pretrained LLMs exhibit performance enhancements on tasks in their respective languages when trained on monolingual data rather than relying exclusively on multilingual datasets. The research leverages a substantial corpus to refine T5's capabilities on Portuguese-specific tasks, showcasing the advantages of implementing a dedicated Portuguese vocabulary in conjunction with model pretraining.

Methodological Approach

The authors employ the BrWac corpus, composed of over 60 million Brazilian Portuguese web pages, as the core dataset for pretraining T5. This process involves a denoising objective wherein sequences are corrupted and subsequently reconstructed, aligning with the paradigm outlined in the original T5 research. The study further introduces a custom Portuguese vocabulary, developed using the SentencePiece library, to potentially enhance performance by reducing linguistic noise inherent in multilingual vocabularies.

Beyond pretraining, the research explores model fine-tuning on Portuguese-specific tasks. It evaluates performance across two principal tasks—ASSIN 2 and HAREM. The former involves semantic similarity and entailment prediction, while the latter focuses on Named Entity Recognition (NER). These evaluations provide insights into the model's adaptability and efficacy in understanding and generating Portuguese text.

Experimental Results

PTT5 models consistently outperform their T5 counterparts across benchmarks, demonstrating the merit of language-specific pretraining. Notably, in the ASSIN 2 tasks, the model with a custom Portuguese vocabulary exhibited top-tier mean square error scores, suggesting finely-tuned semantic understanding. Particularly, PTT5 Base displayed impressive competitive results, closely contesting BERTimbau, a similar BERT model trained on Portuguese text. For the HAREM task, PTT5 further showed enhanced results over the baseline T5 models, albeit slightly trailing behind the performance of existing BERT models designed for Portuguese.

The research highlights the significance of using a Portuguese-specific vocabulary to achieve improved model performance. Numerical evaluations reveal that the enhanced lexical representations lead to faster convergence during model fine-tuning, affirming the orthogonality of vocabulary specificity and linguistic task efficacy.

Discussion and Implications

This work underscores the potential of targeted pretraining strategies in improving model performance for specific linguistic demographics. The introduction of a native-language vocabulary could serve as a precedent for similar enhancements across other languages currently underserved by NLP resources, ensuring that linguistic nuances and contextual understandings are appropriately addressed.

Future developments could investigate further optimizing the trade-off between model size and performance, particularly within computational resource constraints. Additionally, explorations into hybrid model architectures or alternative pretraining objectives may yield further insights into optimizing LLMs for nuanced linguistic tasks.

Conclusion

The PTT5 project exemplifies the enhanced utility of language-specific tailored models in delivering strong performance for native-language tasks. By introducing a focused pretraining and fine-tuning strategy, the researchers present a compelling case for the development of monolingual models, particularly in the context of languages with limited computational resources. The performance gains observed through the integration of a dedicated vocabulary, combined with the BrWac corpus, lay a solid foundation for ongoing and future exploration of domain-specific LLM refinement.

Markdown