Papers
Topics
Authors
Recent
Search
2000 character limit reached

Indexing Portuguese NLP Resources with PT-Pump-Up

Published 27 Jan 2024 in cs.CL and cs.IR | (2401.15400v1)

Abstract: The recent advances in NLP are linked to training processes that require vast amounts of corpora. Access to this data is commonly not a trivial process due to resource dispersion and the need to maintain these infrastructures online and up-to-date. New developments in NLP are often compromised due to the scarcity of data or lack of a shared repository that works as an entry point to the community. This is especially true in low and mid-resource languages, such as Portuguese, which lack data and proper resource management infrastructures. In this work, we propose PT-Pump-Up, a set of tools that aim to reduce resource dispersion and improve the accessibility to Portuguese NLP resources. Our proposal is divided into four software components: a) a web platform to list the available resources; b) a client-side Python package to simplify the loading of Portuguese NLP resources; c) an administrative Python package to manage the platform and d) a public GitHub repository to foster future collaboration and contributions. All four components are accessible using: https://linktr.ee/pt_pump_up

Definition Search Book Streamline Icon: https://streamlinehq.com
References (7)
  1. Rúben Almeida. 2023. Building portuguese language resources for natural language processing tasks. MSc Thesis, Faculty of Engineering, University of Porto.
  2. António Branco et al. 2023. The clarin infrastructure as an interoperable language technology platform for ssh and beyond. Language Resources and Evaluation, pages 1–32.
  3. A Danzin. 1992. Towards a european language infrastructure (dg xiii).
  4. Pratik Joshi et al. 2020. The state and fate of linguistic diversity and inclusion in the NLP world. CoRR, abs/2004.09095.
  5. Diana Santos. 2002. Um centro de recursos para o processamento computacional do português. DataGramaZero-Revista de Ciência da informaçao, 3(1).
  6. Diana Santos et al. 2004. Linguateca: um centro de recursos distribuído para o processamento computacional da língua portuguesa.
  7. Gary Simons et al. 2003. The open language archives community: An infrastructure for distributed archiving of language resources. Literary and Linguistic Computing, 18(2):117–128.
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.