2000 character limit reached
PyThaiNLP: Thai Natural Language Processing in Python (2312.04649v1)
Published 7 Dec 2023 in cs.CL
Abstract: We present PyThaiNLP, a free and open-source NLP library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained LLMs. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.
- Rami Al-Rfou. 2015. Polyglot. Available at https://pypi.org/project/polyglot/.
- POLYGLOT-NER: Massive multilingual named entity recognition. In Proceedings of the 2015 SIAM International Conference on Data Mining, pages 586–594. SIAM.
- Dimo Angelov. 2020. Top2Vec: Distributed representations of topics.
- Common Voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France. European Language Resources Association.
- Wirote Aroonmanakun. 2002. Collocation and Thai word segmentation. In Proceedings of the Fifth Symposium on Natural Language Processing & The Fifth Oriental COCOSDA Workshop, pages 68–75, Pathumthani, Thailand. Sirindhorn International Institute of Technology.
- Thai National Corpus: A progress report. In Proceedings of the 7th Workshop on Asian Language Resources, ALR7, page 153–158, USA. Association for Computational Linguistics.
- Wirote Aroonmanakun and Attapol Thamrongrattanarit. 2018. Thai Language Toolkit. Available at https://pypi.org/project/tltk/.
- Survey on Thai NLP language resources and tools. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6495–6505, Marseille, France. European Language Resources Association.
- OpenNLP. Available at https://sourceforge.net/projects/opennlp/.
- Steven Bird and Edward Loper. 2004. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217, Barcelona, Spain. Association for Computational Linguistics.
- The annotation guideline of LST20 corpus.
- cakimpei. 2022. Wunsen. Available at https://github.com/cakimpei/wunsen.
- Open collaborative development of the Thai language resources for natural language processing. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).
- Paisarn Charoenpornsawat. 2003. SWATH: Smart Word Analysis for THai. Available at http://www.cs.cmu.edu/~paisarn/software.html.
- Syllable-based neural Thai word segmentation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4619–4637, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- LOTUS-BN: A Thai broadcast news corpus and its research applications. In 2009 Oriental-COCOSDA International Conference on Speech Database and Assessments, pages 44–50, Urumqi, China.
- The development of a large Thai telephone speech corpus: LOTUS-Cell 2.0. In 2010 Oriental-COCOSDA International Conference on Speech Database and Assessments, Kathmandu, Nepal.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- Ethnologue: Languages of the World. Twenty-sixth edition. SIL International.
- Richard Gillam. 1999. Text boundary analysis in Java. In Proceedings of Fifteenth International Unicode Conference, San Jose, California, USA.
- Choochart Haruechaiyasak and Sarawoot Kongyoung. 2009. TLex: Thai lexeme analyser based on the conditional random fields. In Proceedings of 8th International Symposium on Natural Language Processing, Bangkok, Thailand.
- A comparative study on Thai word segmentation approaches. In 2008 5th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, volume 1, pages 125–128.
- Matthew Honnibal. 2013. A good part-of-speech tagger in about 200 lines of Python.
- spaCy: Industrial-strength Natural Language Processing in Python.
- Thai Language Audio Resource Center.
- Jeremy Howard and Sylvain Gugger. 2020. fastai: A Layered API for Deep Learning. Information, 11(2):108.
- Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
- IBM Corporation et al. 1999. International Components for Unicode. Available at https://icu.unicode.org.
- NECTEC-ATR Thai speech corpus. In 2003 Oriental-COCOSDA International Conference on Speech Database and Assessments, pages 105–111, Singapore.
- Thai speech corpus for speech recognition. In 2003 Oriental-COCOSDA International Conference on Speech Database and Assessments, pages 54–61, Singapore.
- A state of the art of Thai language resources and Thai language behavior analysis and modeling. In COLING-02: The 3rd Workshop on Asian Language Resources and International Standardization.
- The CU-MFEC corpus for Thai and English spelling speech recognition. In Proceedings of International Conference on Speech Database and Assessments, pages 18–23.
- Krit Kosawat. 2009. InterBEST 2009: Thai word segmentation workshop. In Proceedings of 8th International Symposium on Natural Language Processing, Bangkok, Thailand.
- BEST 2009: Thai word segmentation software contest. In 2009 Eighth International Symposium on Natural Language Processing, pages 83–88.
- Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. Advances in Neural Information Processing Systems (NeurIPS).
- Handling cross- and out-of-domain samples in Thai word segmentation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1003–1016, Online. Association for Computational Linguistics.
- Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Vichit Lorchirachoonkul. 1982. A Thai soundex system. Information Processing & Management, 18(5):243–255.
- WangchanBERTa: pretraining transformer-based Thai language models.
- A large English–Thai parallel corpus from the web and machine-generated text. Language Resources and Evaluation, 56(2):477–499.
- Matti Lyra. 2019. Effective mocking of unit tests for machine learning.
- Thomas J. McCabe. 1976. A complexity measure. IEEE Transactions on Software Engineering, SE-2(4):308–320.
- Feature-based Thai Word Segmentation. In Proceedings of the Natural Language Processing Pacific Rim Symposium, Phuket, Thailand.
- Microsoft. 2020. Testing data science and MLOps code.
- mmb L. 2018. symspellpy. Available at https://github.com/mammothb/symspellpy.
- National Electronics and Computer Technology Center. 2006. Thai Lexeme Tokenizer: LexTo. [online]. Retrieved August 8, 2023, from http://www.sansarn.com/lexto/.
- RDRPOSTagger: A ripple down rules-based part-of-speech tagger. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 17–20, Gothenburg, Sweden. Association for Computational Linguistics.
- Peter Norvig. 2007. How to write a spelling corrector.
- Terry Peng and Mikhail Korobov. 2014. python-crfsuite. Available at https://github.com/scrapinghub/python-crfsuite.
- Wannaphong Phatthiyaphaibun. 2022. Thai NER 2.0.
- Wannaphong Phatthiyaphaibun and Peerat Limkonchotiwat. 2023. Han-Coref: Thai coreference resolution by PyThaiNLP.
- Multilingual end to end entity linking.
- Charin Polpanumas and Wannaphong Phatthiyaphaibun. 2021. thai2fit: Thai language implementation of ULMFiT.
- WangChanGLM – the multilingual instruction-following model.
- Charin Polpanumas and Phasathorn Suwansri. 2020. Pythainlp/classification-benchmarks: v0.1-alpha.
- PyCon Thailand. 2019. How PyThaiNLP’s thai2fit outperforms Google’s BERT: State-of-the-art Thai text classification and beyond - Charin.
- Vee Satayamas. 2015. wordcutpy. Available at https://github.com/veer66/wordcutpy.
- 82 treebanks, 34 models: Universal Dependency parsing with multi-treebank models. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 113–123, Brussels, Belgium. Association for Computational Linguistics.
- Chakkrit Snae and Michael Brückner. 2009. Novel phonetic name matching algorithm with a statistical ontology for analysing names given in accordance with Thai astrology. Issues in Informing Science and Information Technology, 6:497–515.
- Virach Sornlertlamvanich. 1993. Machine Translation, chapter Word segmentation for Thai in machine translation system. National Electronics and Computer Technology Center.
- The state of the art in Thai language processing. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 1–2, Hong Kong. Association for Computational Linguistics.
- Building a Thai part-of-speech tagged corpus (ORCHID). Journal of the Acoustical Society of Japan (E), 20(3):189–198.
- Sutee Sudprasert and Asanee Kawtrakul. 2003. Thai word segmentation based on global and local unsupervised learning. In Proceedings of the 7th National Computer Science and Engineering Conference, pages 1–8, Chonburi, Thailand.
- Language sense and ambiguity in Thai. In Proceedings of the 8th Pacific Rim International Conference on Artificial Intelligence, Auckland, New Zealand.
- Prayut Suwanvisat and Somboon Prasitjutrakul. 1998. Thai-English cross-language transliterated word retrieval using soundex technique. In Proceesings of the National Computer Science and Engineering Conference, Bangkok, Thailand.
- Thai Linux Working Group. 2001. LibThai. Available at https://linux.thai.net/projects/libthai/.
- Character cluster based Thai information retrieval. In Proceedings of the Fifth International Workshop on on Information Retrieval with Asian Languages, IRAL ’00, page 75–80, New York, NY, USA. Association for Computing Machinery.
- Wannee Udompanich. 1983. String searching for Thai alphabet using Soundex compression technique.
- VISTEC-depa AI Research Institute of Thailand. 2023. wav2vec2-large-xlsr-53-th (revision 3155938).
- Open (for business): Big tech, concentrated power, and the political economy of open AI.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- David Wright. 2021. Phunspell. Available at https://github.com/dvwright/phunspell.
- Chai Wutiwiwatchai and Sadaoki Furui. 2007. Thai speech processing technology: A review. Speech Communication, 49(1):8–27.
- Wannaphong Phatthiyaphaibun (4 papers)
- Korakot Chaovavanich (1 paper)
- Charin Polpanumas (6 papers)
- Arthit Suriyawongkul (3 papers)
- Lalita Lowphansirikul (4 papers)
- Pattarawat Chormai (3 papers)
- Peerat Limkonchotiwat (19 papers)
- Thanathip Suntorntip (1 paper)
- Can Udomcharoenchaikit (8 papers)