Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Thai Universal Dependency Treebank (2405.07586v1)

Published 13 May 2024 in cs.CL

Abstract: Automatic dependency parsing of Thai sentences has been underexplored, as evidenced by the lack of large Thai dependency treebanks with complete dependency structures and the lack of a published systematic evaluation of state-of-the-art models, especially transformer-based parsers. In this work, we address these problems by introducing Thai Universal Dependency Treebank (TUD), a new largest Thai treebank consisting of 3,627 trees annotated in accordance with the Universal Dependencies (UD) framework. We then benchmark dependency parsing models that incorporate pretrained transformers as encoders and train them on Thai-PUD and our TUD. The evaluation results show that most of our models can outperform other models reported in previous papers and provide insight into the optimal choices of components to include in Thai dependency parsers. The new treebank and every model's full prediction generated in our experiment are made available on a GitHub repository for further study.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Mücahit Altıntaş and A. Cüneyd Tantuğ. 2023. Improving the performance of graph based dependency parsing by guiding bi-affine layer with augmented global and local features. Intelligent Systems with Applications, 18:200190.
  2. Wirote Aroonmanakun and Tantong Champaiboon. 2023. Manual for Annotation of Thai Dependency Tree Bank According to Universal Dependencies v.2. Centre for Research in Speech and Language Processing, Department of Linguistics, Faculty of Arts, Chulalongkorn University, Bangkok, Thailand.
  3. Thai national corpus: a progress report. In Proceedings of the 7th Workshop on Asian Language Resources, ALR7, page 153–158, USA. Association for Computational Linguistics.
  4. Survey on Thai NLP language resources and tools. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6495–6505, Marseille, France. European Language Resources Association.
  5. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  6. Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural dependency parsing.
  7. Daniel Jurafsky and James H. Martin. 2024. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. [E-book].
  8. WangchanBERTa: Pretraining transformer-based Thai language models.
  9. Universal Dependencies. Computational Linguistics, 47(2):255–308.
  10. Rethinking self-attention: Towards interpretability in neural parsing. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 731–742, Online. Association for Computational Linguistics.
  11. MaltParser: A data-driven parser-generator for dependency parsing. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association (ELRA).
  12. PyThaiNLP: Thai natural language processing in Python.
  13. A gold standard dependency corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 2897–2904, Reykjavik, Iceland. European Language Resources Association (ELRA).
  14. Sattaya Singkul and Kuntpong Woraratpanya. 2019. Thai dependency parsing with character embedding. In 2019 11th International Conference on Information Technology and Electrical Engineering (ICITEE), pages 1–5.
  15. PhayaThaiBERT: Enhancing a pretrained Thai language model with unassimilated loanwords.
  16. UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4290–4297, Portorož, Slovenia. European Language Resources Association (ELRA).
  17. Attention is all you need.
  18. Koichi Yasuoka. 2023. Sequence-labeling RoBERTa model for dependency-parsing in Classical Chinese and its application to Vietnamese and Thai. In 2023 8th International Conference on Business and Industrial Research (ICBIR), pages 169–173.
  19. Please mind the root: Decoding arborescences for dependency parsing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4809–4819, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com