The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

Published 31 Mar 2026 in cs.CL and cs.LG | (2603.29244v2)

Abstract: We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings, collected through a dedicated community data collection platform involving over 100 contributors. To validate the dataset's utility, we train and evaluate ASR, MT, and TTS models, establishing baselines across all languages. Our best ASR system achieves 3.24% WER on Swahili (Common Voice), reducing prior academic SOTA from 8.3% to 3.24% (5.1 percentage point absolute, 61% relative reduction), and 4.3% WER on Somali. The dataset will be published on HuggingFace. We describe the collection platform, quality assurance workflows, and baseline experiments, and discuss implications for African language technology infrastructure.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents a community-sourced multimodal dataset that fills a critical gap for low-resource African languages.
It employs mobile-first, dual data collection methods to produce high-quality text-audio pairs for improved model training.
Baseline results, including a Swahili ASR WER of 3.24%, demonstrate state-of-the-art performance in speech recognition and synthesis.

The Thiomi Dataset: Multimodal Language Resources for Low-Resource African Languages

Introduction and Motivation

The Thiomi Dataset ("The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages" (2603.29244)) presents a comprehensive, community-sourced corpus spanning ten African languages across four language families, which collectively represent over 300 million speakers. Despite Africa's vast linguistic diversity, the majority of its languages remain underrepresented in NLP research infrastructures and products. The scarcity impacts the availability and accuracy of ASR, MT, and TTS systems, ultimately reinforcing digital marginalization.

The dataset specifically targets languages with both high speaker counts and severe resource deficits, including Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali, Wolof, and Fulani. Given existing datasets such as FLORES-200, FLEURS, MMS, and WAXAL predominantly address evaluation or West and Central African languages, Thiomi serves a critical gap by providing paired text-audio resources for East African languages and establishing baseline models for both speech and translation tasks.

Data Collection Methodology and Platform Architecture

Thiomi employs a mobile-first, web-based platform optimized for low-bandwidth environments and broad accessibility, particularly important for East African contexts. The data acquisition pipeline consists of two complementary strategies:

Translation-Based (Text-First): Curated English sentences covering ten topical domains are translated into the target language, peer-validated, and then recorded as read-aloud speech. This ensures fully aligned EN–target language pairs across diverse registers.
Audio-First (Transcription-Based): Native speakers record spontaneous speech, which is later transcribed and optionally translated. This approach captures colloquial, dialectal, and non-standardized language use often missing from scripted speech data.

The platform incorporates a rigorous, multi-tier quality assurance system—peer review and expert linguistic evaluation for text; automated filters (VAD, SNR) and triple-listener review for audio—which yields high-quality annotations while enabling scalability and reliable ground-truth resources for downstream modeling.

Dataset Composition and Quality Analysis

Thiomi comprises over 601,000 approved sentences and 385,000+ audio recordings across its language inventory. For the major languages, approval rates for text quality range from 87% to 100%, with inter-rater agreement consistently above $\kappa = 0.82$ . Expert-reviewed samples show high pass rates (e.g., 99.7% for Somali, 93.4% for Luo). Audio approval rates fall within 78–86%, with noise and pronunciation being prominent rejection factors.

While text data is fully aligned and balanced across topical domains, ongoing collection for certain languages—especially Maasai, Kipsigis, and Wolof—reflects challenges in recruiting annotators and standardizing orthography for languages with weaker written traditions. The dataset's design ensures strict train/dev/test partitioning with contributor disjointness, improving the validity of benchmarking and transfer learning evaluations.

Baseline Modeling Results

Automatic Speech Recognition (ASR)

Using a Wav2Vec2-BERT model (607M parameters), the authors achieve a Swahili ASR WER of 3.24% on Common Voice, a result which surpasses the previous XLS-R SOTA (8.3%) by 61% relative reduction. Notably, this strong result leverages continued pretraining on a relatively small quantity of labeled data, as detailed in associated research (Mutisya et al., 11 Mar 2026). Similar pipelines extend to other languages, presenting the first published ASR results for several (e.g., Kikuyu 5.5%, Kamba 6.2%, Somali 4.3% WER).

Languages with lower resource availability or complex prosodic features (e.g., Maasai, Kipsigis) exhibit higher WER, highlighting the known ceiling imposed by the absence of tone-marked orthographies in both ASR and TTS.

Machine Translation (MT)

MT baselines use a fine-tuned NLLB-200-distilled-600M model. BLEU scores for Somali–English (64.2), Swahili–English (55.8), and Luo–English (44.6) direction indicate robust translation quality for high-resource settings and constrained domains. BLEU scores above 50 validate the sufficiency of the collected parallel data for effective model training and downstream deployment, although translation from English into the target languages consistently yields lower BLEU—likely reflecting linguistic asymmetries and translational adequacy differences.

Text-to-Speech (TTS)

VITS models trained on Thiomi's audio achieve MOS scores above 4.0 (Swahili: 4.12), interpreted as natural-sounding synthesis. For most other languages, MOS values in the 3.5–4.0 band indicate acceptable quality, albeit with perceptible artifacts. This substantiates the dual-modality corpus design and provides the field with first-of-its-kind TTS models for many of the covered tongues.

Limitations and Ethical Considerations

There are several acknowledged constraints:

Domain Coverage: Absence of legal, journalistic, and scientific content restricts cross-domain generalization.
Orthographic and Prosodic Representation: Lack of standardized tone marking in tonal languages limits phonological modeling accuracy.
Speaker/Dialect Diversity: Urban and university-centric participant recruitment produces incomplete demographic and regional linguistic representation.
Low-Resource Pipeline Stages: Collection for Maasai, Kipsigis, and Wolof remains ongoing—affecting the balance and model coverage.

From an ethical perspective, the paper documents a commitment to contributor consent, fair compensation, privacy-preserving data release (pseudonymized speaker IDs, CC BY 4.0 licensing), and explicit community engagement. The release notes the risk of potential misuse, notably ASR-powered surveillance, and highlights the primacy of local governance frameworks in downstream deployment.

Implications and Prospective Impact

Thiomi delivers robust evidence that coordinated, compensated, and community-driven collection pipelines can yield datasets adequate for state-of-the-art ASR and MT benchmarks in previously under-served African languages. The combination of paired text/audio data and released model baselines will facilitate improvements in multilingual representation, cross-lingual transfer, and inclusivity within NLP systems. In the broader context, Thiomi complements ongoing projects such as WAXAL (Diack et al., 2 Feb 2026), MasakhaNER, and FLORES, but differentiates itself via East African coverage, integrated textual and audio modalities, and transparency regarding quality and collection methodology.

Practical implications extend to the deployment of accessible speech and translation technologies, digital health applications, and inclusive education tools. The dataset's public release on HuggingFace further promotes reproducibility, community extension, and model improvement.

Looking forward, the authors identify further research directions including labeled tone annotation, expansion to additional African languages (Oromo, Amharic, Hausa, Yoruba, Igbo), and the creation of standardized evaluation benchmarks and public model releases.

Conclusion

The Thiomi Dataset ("The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages" (2603.29244)) establishes a comprehensive multimodal corpus—it sets a new performance bar with a Swahili ASR WER of 3.24% and delivers first-ever ASR and TTS baselines for multiple East African languages. The resource is poised to accelerate both practical technology deployment and foundational research in multilingual NLP for linguistically marginalized populations.

Markdown Report Issue