KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis (2404.01033v2)
Abstract: This study focuses on the creation of the KazEmoTTS dataset, designed for emotional Kazakh text-to-speech (TTS) applications. KazEmoTTS is a collection of 54,760 audio-text pairs, with a total duration of 74.85 hours, featuring 34.23 hours delivered by a female narrator and 40.62 hours by two male narrators. The list of the emotions considered include "neutral", "angry", "happy", "sad", "scared", and "surprised". We also developed a TTS model trained on the KazEmoTTS dataset. Objective and subjective evaluations were employed to assess the quality of synthesized speech, yielding an MCD score within the range of 6.02 to 7.67, alongside a MOS that spanned from 3.51 to 3.57. To facilitate reproducibility and inspire further research, we have made our code, pre-trained model, and dataset accessible in our GitHub repository.
- The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems.
- Deep Voice 2: Multi-Speaker Neural Text-to-Speech.
- Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6194–6198.
- Mixed Emotions and Coping: The Benefits of Secondary Emotions. PloS one, 9:e103940.
- IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359.
- Heejin Choi and Minsoo Hahn. 2021. Sequence-to-Sequence Emotional Voice Conversion With Strength Control. IEEE Access, 9:42674–42687.
- EMOVO Corpus: an Italian Emotional Speech Database. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 3501–3504, Reykjavik, Iceland. European Language Resources Association (ELRA).
- EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model.
- Prafulla Dhariwal and Alex Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. ArXiv, abs/2105.05233.
- Chenpeng Du and Kai Yu. 2021. Phone-Level Prosody Modelling with GMM-Based MDN for Diverse and Controllable Speech Synthesis.
- Paul Ekman. 1992. An argument for basic emotions. Cognition & emotion, 6(3-4):169–200.
- EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance. ArXiv, abs/2211.09496.
- Unsupervised word-level prosody tagging for controllable speech synthesis.
- EMOQ-TTS: Emotion Intensity Quantization for Fine-Grained Controllable Emotional Text-to-Speech. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6317–6321.
- Expressive Text-to-Speech Using Style Tag. In Proc. Interspeech 2021, pages 4663–4667.
- HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. ArXiv, abs/2010.05646.
- Robert F. Kubichek. 1993. Mel-cepstral distance measure for objective speech quality assessment. Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, 1:125–128 vol.1.
- Emotional End-to-End Neural Speech Synthesizer.
- More Control for Free! Image Synthesis with Semantic Diffusion Guidance. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 289–299.
- KazakhTTS2: Extending the Open-Source Kazakh TTS Corpus With More Data, Speakers, and Topics.
- Devi Parikh and Kristen Grauman. 2011. Relative attributes. In 2011 International Conference on Computer Vision, pages 503–510.
- R. Plutchik and H. Kellerman. 2013. Theories of Emotion. Emotion, theory, research, and experience. Elsevier Science.
- Robert Plutchik. 2001. The Nature of Emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American Scientist, 89(4):344–350.
- Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech. In International Conference on Machine Learning.
- Robust speech recognition via large-scale weak supervision. arxiv. arXiv preprint arXiv:2212.04356.
- FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.
- Marc Schröder. 2009. Expressive Speech Synthesis: Past, Present, and Possible Futures, pages 111–126. Springer London, London.
- Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.
- Score-Based Generative Modeling through Stochastic Differential Equations. ArXiv, abs/2011.13456.
- EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis. ArXiv, abs/2306.00648.
- Emotional Speech Synthesis with Rich and Granularized Control. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7254–7258.
- Emotional Voice Conversion: Theory, Databases and ESD.
- Speech Synthesis with Mixed Emotions. ArXiv, abs/2208.05890.
- Emotion Intensity and its Control for Emotional Voice Conversion. IEEE Transactions on Affective Computing, 14:31–48.
- Converting Anyone’s Emotion: Towards Speaker-Independent Emotional Voice Conversion.