Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging (1709.00616v1)

Published 2 Sep 2017 in cs.CL

Abstract: Word segmentation plays a pivotal role in improving any Arabic NLP application. Therefore, a lot of research has been spent in improving its accuracy. Off-the-shelf tools, however, are: i) complicated to use and ii) domain/dialect dependent. We explore three language-independent alternatives to morphological segmentation using: i) data-driven sub-word units, ii) characters as a unit of learning, and iii) word embeddings learned using a character CNN (Convolution Neural Network). On the tasks of Machine Translation and POS tagging, we found these methods to achieve close to, and occasionally surpass state-of-the-art performance. In our analysis, we show that a neural machine translation system is sensitive to the ratio of source and target tokens, and a ratio close to 1 or greater, gives optimal performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Hassan Sajjad (64 papers)
  2. Fahim Dalvi (45 papers)
  3. Nadir Durrani (48 papers)
  4. Ahmed Abdelali (21 papers)
  5. Yonatan Belinkov (111 papers)
  6. Stephan Vogel (8 papers)
Citations (27)

Summary

We haven't generated a summary for this paper yet.