Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Impact of Data Corruption on Named Entity Recognition for Low-resourced Languages (2208.04568v2)

Published 9 Aug 2022 in cs.CL and cs.AI

Abstract: Data availability and quality are major challenges in natural language processing for low-resourced languages. In particular, there is significantly less data available than for higher-resourced languages. This data is also often of low quality, rife with errors, invalid text or incorrect annotations. Many prior works focus on dealing with these problems, either by generating synthetic data, or filtering out low-quality parts of datasets. We instead investigate these factors more deeply, by systematically measuring the effect of data quantity and quality on the performance of pre-trained LLMs in a low-resourced setting. Our results show that having fewer completely-labelled sentences is significantly better than having more sentences with missing labels; and that models can perform remarkably well with only 10% of the training data. Importantly, these results are consistent across ten low-resource languages, English, and four pre-trained models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Extrinsic evaluation of sentence alignment systems. Workshop on Creating Cross-language Resources for Disconnected Languages and Styles.
  2. Separating grains from the chaff: Using data filtering to improve multilingual translation for low-resourced african languages. In Proceedings of the Seventh Conference on Machine Translation: Shared Task Papers.
  3. Masakhaner: Named entity recognition for african languages. Trans. Assoc. Comput. Linguistics, 9:1116–1131.
  4. A few thousand translations go a long way! leveraging pre-trained models for african news translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 3053–3070. Association for Computational Linguistics.
  5. Masakhaner 2.0: Africa-centric transfer learning for named entity recognition. CoRR, abs/2210.12391.
  6. Multilingual language model adaptive fine-tuning: A study on african languages. CoRR, abs/2204.06487.
  7. Massive vs. curated word embeddings for low-resourced languages. the case of yorùbá and twi. CoRR, abs/1912.02481.
  8. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8440–8451. Association for Computational Linguistics.
  9. Quality versus quantity: Building Catalan-English MT resources. In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, pages 59–69, Marseille, France. European Language Resources Association.
  10. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  11. Deep bayesian active learning with image data. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 1183–1192. PMLR.
  12. Does more data always yield better translations? In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 152–161, Avignon, France. Association for Computational Linguistics.
  13. Quality at a glance: An audit of web-crawled multilingual datasets. Trans. Assoc. Comput. Linguistics, 10:50–72.
  14. Guillaume Lample and Devendra Singh Chaplot. 2017. Playing FPS games with deep reinforcement learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pages 2140–2146. AAAI Press.
  15. Laura Martinus and Jade Z. Abbott. 2019. A focus on neural machine translation for african languages. CoRR, abs/1906.05685.
  16. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 15630–15649. PMLR.
  17. Participatory translations of oshiwambo: Towards sustainable culture preservation with language technology. In 3rd Workshop on African Natural Language Processing.
  18. Participatory research for low-resourced machine translation: A case study in african languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 2144–2160. Association for Computational Linguistics.
  19. Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 116–126, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  20. Improving language understanding by generative pre-training.
  21. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  22. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31 - June 1, 2003, pages 142–147. ACL.
  23. Beyond neural scaling laws: beating power law scaling via data pruning. CoRR, abs/2206.14486.
  24. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.

Summary

We haven't generated a summary for this paper yet.