Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PyThaiNLP: Thai Natural Language Processing in Python (2312.04649v1)

Published 7 Dec 2023 in cs.CL

Abstract: We present PyThaiNLP, a free and open-source NLP library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained LLMs. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Rami Al-Rfou. 2015. Polyglot. Available at https://pypi.org/project/polyglot/.
  2. POLYGLOT-NER: Massive multilingual named entity recognition. In Proceedings of the 2015 SIAM International Conference on Data Mining, pages 586–594. SIAM.
  3. Dimo Angelov. 2020. Top2Vec: Distributed representations of topics.
  4. Common Voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France. European Language Resources Association.
  5. Wirote Aroonmanakun. 2002. Collocation and Thai word segmentation. In Proceedings of the Fifth Symposium on Natural Language Processing & The Fifth Oriental COCOSDA Workshop, pages 68–75, Pathumthani, Thailand. Sirindhorn International Institute of Technology.
  6. Thai National Corpus: A progress report. In Proceedings of the 7th Workshop on Asian Language Resources, ALR7, page 153–158, USA. Association for Computational Linguistics.
  7. Wirote Aroonmanakun and Attapol Thamrongrattanarit. 2018. Thai Language Toolkit. Available at https://pypi.org/project/tltk/.
  8. Survey on Thai NLP language resources and tools. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6495–6505, Marseille, France. European Language Resources Association.
  9. OpenNLP. Available at https://sourceforge.net/projects/opennlp/.
  10. Steven Bird and Edward Loper. 2004. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217, Barcelona, Spain. Association for Computational Linguistics.
  11. The annotation guideline of LST20 corpus.
  12. cakimpei. 2022. Wunsen. Available at https://github.com/cakimpei/wunsen.
  13. Open collaborative development of the Thai language resources for natural language processing. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).
  14. Paisarn Charoenpornsawat. 2003. SWATH: Smart Word Analysis for THai. Available at http://www.cs.cmu.edu/~paisarn/software.html.
  15. Syllable-based neural Thai word segmentation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4619–4637, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  16. LOTUS-BN: A Thai broadcast news corpus and its research applications. In 2009 Oriental-COCOSDA International Conference on Speech Database and Assessments, pages 44–50, Urumqi, China.
  17. The development of a large Thai telephone speech corpus: LOTUS-Cell 2.0. In 2010 Oriental-COCOSDA International Conference on Speech Database and Assessments, Kathmandu, Nepal.
  18. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  19. Ethnologue: Languages of the World. Twenty-sixth edition. SIL International.
  20. Richard Gillam. 1999. Text boundary analysis in Java. In Proceedings of Fifteenth International Unicode Conference, San Jose, California, USA.
  21. Choochart Haruechaiyasak and Sarawoot Kongyoung. 2009. TLex: Thai lexeme analyser based on the conditional random fields. In Proceedings of 8th International Symposium on Natural Language Processing, Bangkok, Thailand.
  22. A comparative study on Thai word segmentation approaches. In 2008 5th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, volume 1, pages 125–128.
  23. Matthew Honnibal. 2013. A good part-of-speech tagger in about 200 lines of Python.
  24. spaCy: Industrial-strength Natural Language Processing in Python.
  25. Thai Language Audio Resource Center.
  26. Jeremy Howard and Sylvain Gugger. 2020. fastai: A Layered API for Deep Learning. Information, 11(2):108.
  27. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
  28. IBM Corporation et al. 1999. International Components for Unicode. Available at https://icu.unicode.org.
  29. NECTEC-ATR Thai speech corpus. In 2003 Oriental-COCOSDA International Conference on Speech Database and Assessments, pages 105–111, Singapore.
  30. Thai speech corpus for speech recognition. In 2003 Oriental-COCOSDA International Conference on Speech Database and Assessments, pages 54–61, Singapore.
  31. A state of the art of Thai language resources and Thai language behavior analysis and modeling. In COLING-02: The 3rd Workshop on Asian Language Resources and International Standardization.
  32. The CU-MFEC corpus for Thai and English spelling speech recognition. In Proceedings of International Conference on Speech Database and Assessments, pages 18–23.
  33. Krit Kosawat. 2009. InterBEST 2009: Thai word segmentation workshop. In Proceedings of 8th International Symposium on Natural Language Processing, Bangkok, Thailand.
  34. BEST 2009: Thai word segmentation software contest. In 2009 Eighth International Symposium on Natural Language Processing, pages 83–88.
  35. Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. Advances in Neural Information Processing Systems (NeurIPS).
  36. Handling cross- and out-of-domain samples in Thai word segmentation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1003–1016, Online. Association for Computational Linguistics.
  37. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  38. Vichit Lorchirachoonkul. 1982. A Thai soundex system. Information Processing & Management, 18(5):243–255.
  39. WangchanBERTa: pretraining transformer-based Thai language models.
  40. A large English–Thai parallel corpus from the web and machine-generated text. Language Resources and Evaluation, 56(2):477–499.
  41. Matti Lyra. 2019. Effective mocking of unit tests for machine learning.
  42. Thomas J. McCabe. 1976. A complexity measure. IEEE Transactions on Software Engineering, SE-2(4):308–320.
  43. Feature-based Thai Word Segmentation. In Proceedings of the Natural Language Processing Pacific Rim Symposium, Phuket, Thailand.
  44. Microsoft. 2020. Testing data science and MLOps code.
  45. mmb L. 2018. symspellpy. Available at https://github.com/mammothb/symspellpy.
  46. National Electronics and Computer Technology Center. 2006. Thai Lexeme Tokenizer: LexTo. [online]. Retrieved August 8, 2023, from http://www.sansarn.com/lexto/.
  47. RDRPOSTagger: A ripple down rules-based part-of-speech tagger. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 17–20, Gothenburg, Sweden. Association for Computational Linguistics.
  48. Peter Norvig. 2007. How to write a spelling corrector.
  49. Terry Peng and Mikhail Korobov. 2014. python-crfsuite. Available at https://github.com/scrapinghub/python-crfsuite.
  50. Wannaphong Phatthiyaphaibun. 2022. Thai NER 2.0.
  51. Wannaphong Phatthiyaphaibun and Peerat Limkonchotiwat. 2023. Han-Coref: Thai coreference resolution by PyThaiNLP.
  52. Multilingual end to end entity linking.
  53. Charin Polpanumas and Wannaphong Phatthiyaphaibun. 2021. thai2fit: Thai language implementation of ULMFiT.
  54. WangChanGLM – the multilingual instruction-following model.
  55. Charin Polpanumas and Phasathorn Suwansri. 2020. Pythainlp/classification-benchmarks: v0.1-alpha.
  56. PyCon Thailand. 2019. How PyThaiNLP’s thai2fit outperforms Google’s BERT: State-of-the-art Thai text classification and beyond - Charin.
  57. Vee Satayamas. 2015. wordcutpy. Available at https://github.com/veer66/wordcutpy.
  58. 82 treebanks, 34 models: Universal Dependency parsing with multi-treebank models. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 113–123, Brussels, Belgium. Association for Computational Linguistics.
  59. Chakkrit Snae and Michael Brückner. 2009. Novel phonetic name matching algorithm with a statistical ontology for analysing names given in accordance with Thai astrology. Issues in Informing Science and Information Technology, 6:497–515.
  60. Virach Sornlertlamvanich. 1993. Machine Translation, chapter Word segmentation for Thai in machine translation system. National Electronics and Computer Technology Center.
  61. The state of the art in Thai language processing. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 1–2, Hong Kong. Association for Computational Linguistics.
  62. Building a Thai part-of-speech tagged corpus (ORCHID). Journal of the Acoustical Society of Japan (E), 20(3):189–198.
  63. Sutee Sudprasert and Asanee Kawtrakul. 2003. Thai word segmentation based on global and local unsupervised learning. In Proceedings of the 7th National Computer Science and Engineering Conference, pages 1–8, Chonburi, Thailand.
  64. Language sense and ambiguity in Thai. In Proceedings of the 8th Pacific Rim International Conference on Artificial Intelligence, Auckland, New Zealand.
  65. Prayut Suwanvisat and Somboon Prasitjutrakul. 1998. Thai-English cross-language transliterated word retrieval using soundex technique. In Proceesings of the National Computer Science and Engineering Conference, Bangkok, Thailand.
  66. Thai Linux Working Group. 2001. LibThai. Available at https://linux.thai.net/projects/libthai/.
  67. Character cluster based Thai information retrieval. In Proceedings of the Fifth International Workshop on on Information Retrieval with Asian Languages, IRAL ’00, page 75–80, New York, NY, USA. Association for Computing Machinery.
  68. Wannee Udompanich. 1983. String searching for Thai alphabet using Soundex compression technique.
  69. VISTEC-depa AI Research Institute of Thailand. 2023. wav2vec2-large-xlsr-53-th (revision 3155938).
  70. Open (for business): Big tech, concentrated power, and the political economy of open AI.
  71. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  72. David Wright. 2021. Phunspell. Available at https://github.com/dvwright/phunspell.
  73. Chai Wutiwiwatchai and Sadaoki Furui. 2007. Thai speech processing technology: A review. Speech Communication, 49(1):8–27.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
Citations (81)

Summary

An Overview of PyThaiNLP: Thai Natural Language Processing in Python

The paper "PyThaiNLP: Thai Natural Language Processing in Python" by Wannaphong Phatthiyaphaibun et al. introduces PyThaiNLP, an open-source NLP library specifically tailored for the Thai language. This paper outlines the motivations, functionalities, datasets, and the ecosystem developed around PyThaiNLP, emphasizing its role in advancing Thai NLP by providing comprehensive tools and resources.

Context and Motivation

Historically, Thai language processing has faced challenges due to limited linguistic resources. Unlike languages such as English and Chinese, which benefit from abundant datasets and tools, Thai NLP has been underrepresented. The scarcity of open-source software and data has hindered progress in developing advanced applications for Thai. PyThaiNLP addresses this gap by offering a unified toolkit that integrates various models and datasets to democratize NLP capabilities for Thai.

Key Functionalities

PyThaiNLP provides a suite of tools ranging from basic processing tasks to more sophisticated models:

  • Tokenization: Supports multiple algorithms for word and sentence tokenization, utilizing dictionary-based methods and conditional random fields.
  • Spell Checking: Implements numerous engines, including adaptations of famous algorithms like Norvig's and SymSpell.
  • Transliteration and Phonetics: Offers functionalities for grapheme-to-phoneme conversion, Soundex algorithms, and transliteration systems.
  • Sequence Tagging: Includes models for named-entity recognition and part-of-speech tagging, leveraging pre-trained models such as WangchanBERTa.
  • Machine Translation and ASR: Collaborates with AIResearch.in.th to provide machine translation models and ASR systems trained on datasets like Common Voice.

Development Milestones and Community Impact

Since its inception in 2016, PyThaiNLP has made substantial progress, as evident from its release cycle and the expanding number of contributors. The creators emphasize user-friendliness, evident in their adoption of interfaces familiar from widely-used libraries like NLTK. The collaboration with VISTEC-depa Thailand AI Research Institute has been pivotal, providing computational resources for training large-scale models and expanding the library's reach.

Practical and Theoretical Implications

PyThaiNLP has been extensively adopted both in academia and industry. It supports diverse research endeavors, including cross-lingual LLM pretraining and universal dependency parsing. The library's application in industries such as banking, telecommunications, and retail, illustrates its impact on enhancing NLP capabilities in real-world settings.

The authors provide several industry use cases demonstrating PyThaiNLP's contributions to improving business outcomes through tasks like intent classification and recommendation engines.

Future Directions

The paper concludes by identifying key areas for future development:

  1. Domain-Specific Datasets/Models: There's a need for specialized models to handle domain-specific tasks effectively, such as medical or legal document processing.
  2. Benchmarking: Establishing robust benchmarks for Thai NLP can enhance the evaluation and comparison of different models.
  3. Improved Consistency: Ensuring deterministic behavior in tokenization and sorting tasks is critical for maintaining applications' reliability.
  4. Integration with Standard Libraries: The ultimate goal is seamless compatibility with language-agnostic tools to further diminish the dependency on specialized libraries.

Conclusion

"PyThaiNLP: Thai Natural Language Processing in Python" not only fills a critical void in the Thai NLP landscape by providing comprehensive tools and datasets but also sets a foundation for research and industry adoption. As the library continues to evolve, it promises to catalyze advancements in NLP for low-resource languages, fostering greater inclusivity in the field of AI.

Github Logo Streamline Icon: https://streamlinehq.com