Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Yorùbá Diacritic Restoration (2003.10564v1)

Published 23 Mar 2020 in cs.CL

Abstract: Yor`ub\'a is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics. They provide morphological information, are crucial for lexical disambiguation, pronunciation and are vital for any computational Speech or Natural Language Processing tasks. However diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage. We report on recent efforts at dataset cultivation. By aggregating and improving disparate texts from the web and various personal libraries, we were able to significantly grow our clean Yor`ub\'a dataset from a majority Bibilical text corpora with three sources to millions of tokens from over a dozen sources. We evaluate updated diacritic restoration models on a new, general purpose, public-domain Yor`ub\'a evaluation dataset of modern journalistic news text, selected to be multi-purpose and reflecting contemporary usage. All pre-trained models, datasets and source-code have been released as an open-source project to advance efforts on Yor`ub\'a language technology.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Iroro Orife (20 papers)
  2. David I. Adelani (8 papers)
  3. Timi Fasubaa (2 papers)
  4. Victor Williamson (2 papers)
  5. Wuraola Fisayo Oyewusi (3 papers)
  6. Olamilekan Wahab (2 papers)
  7. Kola Tubosun (2 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.