Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation (2402.16380v1)

Published 26 Feb 2024 in eess.AS, cs.AI, cs.CL, and cs.LG

Abstract: Data availability is crucial for advancing artificial intelligence applications, including voice-based technologies. As content creation, particularly in social media, experiences increasing demand, translation and text-to-speech (TTS) technologies have become essential tools. Notably, the performance of these TTS technologies is highly dependent on the quality of the training data, emphasizing the mutual dependence of data availability and technological progress. This paper introduces an end-to-end tool to generate high-quality datasets for text-to-speech (TTS) models to address this critical need for high-quality data. The contributions of this work are manifold and include: the integration of language-specific phoneme distribution into sample selection, automation of the recording process, automated and human-in-the-loop quality assurance of recordings, and processing of recordings to meet specified formats. The proposed application aims to streamline the dataset creation process for TTS models through these features, thereby facilitating advancements in voice-based technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713.
  2. Generative adversarial nets. Advances in neural information processing systems, 27.
  3. Word-level asr quality estimation for efficient corpus sampling and post-editing through analyzing attentions of a reference-free metric. In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, April 14-19, 2024.
  4. Diff-tts: A denoising diffusion model for text-to-speech. arXiv preprint arXiv:2104.01409.
  5. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761.
  6. Dataperf: Benchmarks for data-centric ai development. arXiv preprint arXiv:2207.10062.
  7. Parallel wavenet: Fast high-fidelity speech synthesis. In International conference on machine learning. PMLR.
  8. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
  9. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pages 8599–8608. PMLR.
  10. Waveglow: A flow-based generative network for speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–3621. IEEE.
  11. Danilo Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In International conference on machine learning. PMLR.
  12. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
  13. A reference-less quality metric for automatic speech recognition via contrastive-learning of a multi-language model with self-supervision. In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE.
  14. Norefer: a referenceless quality metric for automatic speech recognition via semi-supervised language model fine-tuning with contrastive learning. In Proc. INTERSPEECH 2023, pages 466–470.
  15. Efficient machine translation corpus generation. In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas. AMTA.
  16. A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need? arXiv preprint arXiv:2303.11717.

Summary

We haven't generated a summary for this paper yet.