Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments on Kurdish (Sorani) Texts

Published 9 Apr 2020 in cs.CL | (2004.14134v2)

Abstract: Segmentation is a fundamental step for most Natural Language Processing tasks. The Kurdish language is a multi-dialect, under-resourced language which is written in different scripts. The lack of various segmented corpora is one of the major bottlenecks in Kurdish language processing. We used Punkt, an unsupervised machine learning method, to segment a Kurdish corpus of Sorani dialect, written in Persian-Arabic script. According to the literature, studies on using Punkt on non-Latin data are scanty. In our experiment, we achieved an F1 score of 91.10% and had an Error Rate of 16.32%. The high Error Rate is mainly due to the situation of abbreviations in Kurdish and partly because of ordinal numerals. The data is publicly available at https://github.com/KurdishBLARK/ KTC-Segmented for non-commercial use under the CC BY-NC-SA 4.0 licence.

Abstract PDF Upgrade to Chat

Citations (4)

View on Semantic Scholar

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments on Kurdish (Sorani) Texts

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (2)

Collections

Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments on Kurdish (Sorani) Texts

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (2)

Collections