2000 character limit reached
Shamela: A Large-Scale Historical Arabic Corpus (1612.08989v1)
Published 28 Dec 2016 in cs.CL
Abstract: Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected case-studies in which we show its application to the digital humanities.
- Yonatan Belinkov (111 papers)
- Alexander Magidow (2 papers)
- Maxim Romanov (3 papers)
- Avi Shmidman (13 papers)
- Moshe Koppel (16 papers)