Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ViLexNorm: A Lexical Normalization Corpus for Vietnamese Social Media Text (2401.16403v2)

Published 29 Jan 2024 in cs.CL

Abstract: Lexical normalization, a fundamental task in NLP, involves the transformation of words into their canonical forms. This process has been proven to benefit various downstream NLP tasks greatly. In this work, we introduce Vietnamese Lexical Normalization (ViLexNorm), the first-ever corpus developed for the Vietnamese lexical normalization task. The corpus comprises over 10,000 pairs of sentences meticulously annotated by human annotators, sourced from public comments on Vietnam's most popular social media platforms. Various methods were used to evaluate our corpus, and the best-performing system achieved a result of 57.74% using the Error Reduction Rate (ERR) metric (van der Goot, 2019a) with the Leave-As-Is (LAI) baseline. For extrinsic evaluation, employing the model trained on ViLexNorm demonstrates the positive impact of the Vietnamese lexical normalization task on other NLP tasks. Our corpus is publicly available exclusively for research purposes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  2. How noisy social media text, how diffrnt social media sources? In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 356–364, Nagoya, Japan. Asian Federation of Natural Language Processing.
  3. Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. In Proceedings of the Workshop on Noisy User-generated Text, pages 126–135, Beijing, China. Association for Computational Linguistics.
  4. Normalization of Indonesian-English code-mixed Twitter data. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 417–424, Hong Kong, China. Association for Computational Linguistics.
  5. Universal Dependency parsing for Hindi-English code-switching. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 987–998, New Orleans, Louisiana. Association for Computational Linguistics.
  6. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar. Association for Computational Linguistics.
  7. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
  8. Vsec: Transformer-based model for vietnamese spelling correction. In PRICAI 2021: Trends in Artificial Intelligence: 18th Pacific Rim International Conference on Artificial Intelligence, PRICAI 2021, Hanoi, Vietnam, November 8–12, 2021, Proceedings, Part II 18, pages 259–272. Springer.
  9. Jacob Eisenstein. 2013. What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 359–369, Atlanta, Georgia. Association for Computational Linguistics.
  10. Cmc training corpus janes-tag 2.0.
  11. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 368–378, Portland, Oregon, USA. Association for Computational Linguistics.
  12. User-generated text corpus for evaluating Japanese morphological analysis and lexical normalization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5532–5541, Online. Association for Computational Linguistics.
  13. Emotion recognition for vietnamese social media text. CoRR, abs/1911.09339.
  14. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
  15. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431, Valencia, Spain. Association for Computational Linguistics.
  16. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar. Association for Computational Linguistics.
  17. Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union.
  18. Croatian twitter training corpus ReLDI-NormTagNER-hr 2.0. Slovenian language resource repository CLARIN.SI.
  19. Serbian twitter training corpus reldi-normtagner-sr 2.0.
  20. A large-scale dataset for hate speech detection on vietnamese social media texts. In Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices, pages 415–426, Cham. Springer International Publishing.
  21. hinglishNorm - a corpus of Hindi-English code mixed sentences for text normalization. In Proceedings of the 28th International Conference on Computational Linguistics: Industry Track, pages 136–145, Online. International Committee on Computational Linguistics.
  22. Enhancing BERT for lexical normalization. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 297–306, Hong Kong, China. Association for Computational Linguistics.
  23. Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1037–1042, Online. Association for Computational Linguistics.
  24. On learning and representing social meaning in NLP: a sociolinguistic perspective. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 603–612, Online. Association for Computational Linguistics.
  25. A vietnamese spelling correction system. In Companion Proceedings of the 28th International Conference on Intelligent User Interfaces, IUI ’23 Companion, page 158–161, New York, NY, USA. Association for Computing Machinery.
  26. Normalization of vietnamese tweets on twitter. In Intelligent Data Analysis and Applications: Proceedings of the Second Euro-China Conference on Intelligent Data Analysis and Applications, ECC 2015, pages 179–189. Springer.
  27. Text normalization for named entity recognition in vietnamese tweets. Computational social networks, 3:1–16.
  28. DaN+: Danish nested named entities and lexical normalization. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6649–6662, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  29. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  30. Wladimir Sidorenko. 2019. Sentiment analysis of german twitter. arXiv preprint arXiv:1911.13062.
  31. BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese. In Proceedings of the 23rd Annual Conference of the International Speech Communication Association.
  32. Rob van der Goot. 2019a. MoNoise: A multi-lingual and easy-to-use lexical normalization tool. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 201–206, Florence, Italy. Association for Computational Linguistics.
  33. Rob van der Goot. 2019b. Normalization and parsing algorithms for uncertain input. Ph.D. thesis, University of Groningen.
  34. Norm it! lexical normalization for Italian and its downstream effects for dependency parsing. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6272–6278, Marseille, France. European Language Resources Association.
  35. MultiLexNorm: A shared task on multilingual lexical normalization. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 493–509, Online. Association for Computational Linguistics.
  36. A taxonomy for in-depth evaluation of normalization for user generated content. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  37. Detecting spam reviews on vietnamese e-commerce websites. In Intelligent Information and Database Systems, pages 595–607, Cham. Springer International Publishing.
  38. Attention is all you need. Advances in neural information processing systems, 30.
  39. VnCoreNLP: A Vietnamese natural language processing toolkit. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 56–60, New Orleans, Louisiana. Association for Computational Linguistics.
  40. A log-linear model for unsupervised text normalization. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 61–72, Seattle, Washington, USA. Association for Computational Linguistics.
  41. How to tag non-standard language: Normalisation versus domain adaptation for slovene historical and user-generated texts. Natural Language Engineering, 25:651–674.
Citations (2)

Summary

We haven't generated a summary for this paper yet.