Building a Luganda Text-to-Speech Model From Crowdsourced Data (2405.10211v1)
Abstract: Text-to-speech (TTS) development for African languages such as Luganda is still limited, primarily due to the scarcity of high-quality, single-speaker recordings essential for training TTS models. Prior work has focused on utilizing the Luganda Common Voice recordings of multiple speakers aged between 20-49. Although the generated speech is intelligible, it is still of lower quality than the model trained on studio-grade recordings. This is due to the insufficient data preprocessing methods applied to improve the quality of the Common Voice recordings. Furthermore, speech convergence is more difficult to achieve due to varying intonations, as well as background noise. In this paper, we show that the quality of Luganda TTS from Common Voice can improve by training on multiple speakers of close intonation in addition to further preprocessing of the training data. Specifically, we selected six female speakers with close intonation determined by subjectively listening and comparing their voice recordings. In addition to trimming out silent portions from the beginning and end of the recordings, we applied a pre-trained speech enhancement model to reduce background noise and enhance audio quality. We also utilized a pre-trained, non-intrusive, self-supervised Mean Opinion Score (MOS) estimation model to filter recordings with an estimated MOS over 3.5, indicating high perceived quality. Subjective MOS evaluations from nine native Luganda speakers demonstrate that our TTS model achieves a significantly better MOS of 3.55 compared to the reported 2.5 MOS of the existing model. Moreover, for a fair comparison, our model trained on six speakers outperforms models trained on a single-speaker (3.13 MOS) or two speakers (3.22 MOS). This showcases the effectiveness of compensating for the lack of data from one speaker with data from multiple speakers of close intonation to improve TTS quality.
- Hifi++: a unified framework for neural vocoding, bandwidth extension and speech enhancement. arXiv e-prints, pp. arXiv–2203, 2022.
- Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019.
- Building text and speech datasets for low resourced languages: A case of languages in east africa. 2022.
- Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847, 2020.
- The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
- Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33:8067–8077, 2020.
- Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pp. 5530–5540. PMLR, 2021.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
- Effect of data reduction on sequence-to-sequence neural tts. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7075–7079. IEEE, 2019.
- Training multi-speaker neural text-to-speech systems using speaker-imbalanced speech corpora. arXiv preprint arXiv:1904.00771, 2019.
- Building TTS systems for low resource languages under resource constraints. In Proc. 1st Workshop on Speech for Social Good (S4SG), 2022.
- Can we use common voice to train a multi-speaker tts system? In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 900–905. IEEE, 2023.
- Multilingual model and data resources for text-to-speech in ugandan languages. In 4th Workshop on African Natural Language Processing, 2023.
- Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), volume 2, pp. 749–752. IEEE, 2001.
- A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE international conference on acoustics, speech and signal processing, pp. 4214–4217. IEEE, 2010.
- Probability density distillation with generative adversarial networks for high-quality parallel waveform generation. arXiv preprint arXiv:1904.04472, 2019.
- Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203. IEEE, 2020.