Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Attentive Sequence-to-Sequence Learning for Diacritic Restoration of Yorùbá Language Text (1804.00832v2)

Published 3 Apr 2018 in cs.CL

Abstract: Yor`ub\'a is a widely spoken West African language with a writing system rich in tonal and orthographic diacritics. With very few exceptions, diacritics are omitted from electronic texts, due to limited device and application support. Diacritics provide morphological information, are crucial for lexical disambiguation, pronunciation and are vital for any Yor`ub\'a text-to-speech (TTS), automatic speech recognition (ASR) and NLP tasks. Reframing Automatic Diacritic Restoration (ADR) as a machine translation task, we experiment with two different attentive Sequence-to-Sequence neural models to process undiacritized text. On our evaluation dataset, this approach produces diacritization error rates of less than 5%. We have released pre-trained models, datasets and source-code as an open-source project to advance efforts on Yor`ub\'a language technology.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Iroro Orife (20 papers)
Citations (24)

Summary

We haven't generated a summary for this paper yet.