Papers
Topics
Authors
Recent
2000 character limit reached

DORE: A Dataset For Portuguese Definition Generation

Published 26 Mar 2024 in cs.CL and cs.LG | (2403.18018v2)

Abstract: Definition modelling (DM) is the task of automatically generating a dictionary definition for a specific word. Computational systems that are capable of DM can have numerous applications benefiting a wide range of audiences. As DM is considered a supervised natural language generation problem, these systems require large annotated datasets to train the ML models. Several DM datasets have been released for English and other high-resource languages. While Portuguese is considered a mid/high-resource language in most natural language processing tasks and is spoken by more than 200 million native speakers, there is no DM dataset available for Portuguese. In this research, we fill this gap by introducing DORE; the first dataset for Definition MOdelling for PoRtuguEse containing more than 100,000 definitions. We also evaluate several deep learning based DM models on DORE and report the results. The dataset and the findings of this paper will facilitate research and study of Portuguese in wider contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. The falcon series of open language models. arXiv preprint arXiv:2311.16867.
  2. Generationary or “how we went beyond word sense inventories and learned to gloss”. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7207–7221.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Ting-Yun Chang and Yun-Nung Chen. 2019. What does this word mean? explaining contextualized embeddings with natural language definition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6064–6070.
  5. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  6. A survey of multilingual neural machine translation. ACM Comput. Surv., 53(5).
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  8. María José Domínguez Vázquez and Rufus H Gouws. 2023. The Definition, Presentation and Automatic Generation of Contextual Data in Lexicography. International Journal of Lexicography, page ecac020.
  9. Anna Dziemianko. 2020. Smart advertising and online dictionary usefulness. International Journal of Lexicography, 33(4):377–403.
  10. Conditional generators of words definitions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 266–271, Melbourne, Australia. Association for Computational Linguistics.
  11. Definition modelling for appropriate specificity. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2499–2509, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  12. Learning to describe unknown phrases with local and global contexts. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3467–3476.
  13. Arman Kabiri and Paul Cook. 2020. Evaluating a multi-sense definition generation model for multiple languages. In International Conference on Text, Speech, and Dialogue, pages 153–161. Springer.
  14. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  15. Learning architectures from an extended search space for language modeling. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6629–6639, Online. Association for Computational Linguistics.
  16. Mark my word: A sequence-to-sequence approach to definition modeling. In Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing, pages 1–11, Turku, Finland. Linköping University Electronic Press.
  17. Semeval-2022 task 1: CODWOE – comparing dictionaries and word embeddings. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 1–14, Seattle, United States. Association for Computational Linguistics.
  18. Learning to explain non-standard English words and phrases. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 413–417, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  19. Definition modeling: Learning to define word embeddings in natural language. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, page 3259–3266. AAAI Press.
  20. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.
  21. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67.
  22. Antonio San Martín. 2021. A Flexible Approach to Terminological Definitions: Representing Thematic Variation. International Journal of Lexicography, 35(1):53–74.
  23. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
  24. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 223–231, Cambridge, Massachusetts, USA. Association for Machine Translation in the Americas.
  25. Bertimbau: pretrained bert models for brazilian portuguese. In Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I 9, pages 403–417. Springer.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  27. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  28. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  29. Incorporating sememes into chinese definition modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:1669–1677.
  30. Improving interpretability of word embeddings by generating definition and usage. Expert Systems with Applications, 160:113633.
  31. Assisting language learners: Automated trans-lingual definition generation via contrastive prompt learning. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 260–274, Toronto, Canada. Association for Computational Linguistics.
  32. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  33. JADE: Corpus for Japanese definition modelling. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6884–6888, Marseille, France. European Language Resources Association.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.