Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ThaiCoref: Thai Coreference Resolution Dataset (2406.06000v1)

Published 10 Jun 2024 in cs.CL

Abstract: While coreference resolution is a well-established research area in NLP, research focusing on Thai language remains limited due to the lack of large annotated corpora. In this work, we introduce ThaiCoref, a dataset for Thai coreference resolution. Our dataset comprises 777,271 tokens, 44,082 mentions and 10,429 entities across four text genres: university essays, newspapers, speeches, and Wikipedia. Our annotation scheme is built upon the OntoNotes benchmark with adjustments to address Thai-specific phenomena. Utilizing ThaiCoref, we train models employing a multilingual encoder and cross-lingual transfer techniques, achieving a best F1 score of 67.88\% on the test set. Error analysis reveals challenges posed by Thai's unique linguistic features. To benefit the NLP community, we make the dataset and the model publicly available at http://www.github.com/nlp-chula/thai-coref .

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Pontakorn Trakuekul (1 paper)
  2. Wei Qi Leong (7 papers)
  3. Charin Polpanumas (6 papers)
  4. Jitkapat Sawatphol (4 papers)
  5. William Chandra Tjhi (7 papers)
  6. Attapol T. Rutherford (6 papers)

Summary

We haven't generated a summary for this paper yet.