Indexing Portuguese NLP Resources with PT-Pump-Up
Abstract: The recent advances in NLP are linked to training processes that require vast amounts of corpora. Access to this data is commonly not a trivial process due to resource dispersion and the need to maintain these infrastructures online and up-to-date. New developments in NLP are often compromised due to the scarcity of data or lack of a shared repository that works as an entry point to the community. This is especially true in low and mid-resource languages, such as Portuguese, which lack data and proper resource management infrastructures. In this work, we propose PT-Pump-Up, a set of tools that aim to reduce resource dispersion and improve the accessibility to Portuguese NLP resources. Our proposal is divided into four software components: a) a web platform to list the available resources; b) a client-side Python package to simplify the loading of Portuguese NLP resources; c) an administrative Python package to manage the platform and d) a public GitHub repository to foster future collaboration and contributions. All four components are accessible using: https://linktr.ee/pt_pump_up
- Rúben Almeida. 2023. Building portuguese language resources for natural language processing tasks. MSc Thesis, Faculty of Engineering, University of Porto.
- António Branco et al. 2023. The clarin infrastructure as an interoperable language technology platform for ssh and beyond. Language Resources and Evaluation, pages 1–32.
- A Danzin. 1992. Towards a european language infrastructure (dg xiii).
- Pratik Joshi et al. 2020. The state and fate of linguistic diversity and inclusion in the NLP world. CoRR, abs/2004.09095.
- Diana Santos. 2002. Um centro de recursos para o processamento computacional do português. DataGramaZero-Revista de Ciência da informaçao, 3(1).
- Diana Santos et al. 2004. Linguateca: um centro de recursos distribuído para o processamento computacional da língua portuguesa.
- Gary Simons et al. 2003. The open language archives community: An infrastructure for distributed archiving of language resources. Literary and Linguistic Computing, 18(2):117–128.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.