Custom Data Augmentation for low resource ASR using Bark and Retrieval-Based Voice Conversion
Abstract: This paper proposes two innovative methodologies to construct customized Common Voice datasets for low-resource languages like Hindi. The first methodology leverages Bark, a transformer-based text-to-audio model developed by Suno, and incorporates Meta's enCodec and a pre-trained HuBert model to enhance Bark's performance. The second methodology employs Retrieval-Based Voice Conversion (RVC) and uses the Ozen toolkit for data preparation. Both methodologies contribute to the advancement of ASR technology and offer valuable insights into addressing the challenges of constructing customized Common Voice datasets for under-resourced languages. Furthermore, they provide a pathway to achieving high-quality, personalized voice generation for a range of applications.
- Common voice: A massively-multilingual speech corpus, 2020.
- https://github.com/suno-ai/bark.
- James Betker. Better speech synthesis through scaling, 2023.
- End-to-end speaker segmentation for overlap-aware resegmentation, 2021.
- Fine tuning and comparing tacotron 2, deep voice 3, and fastspeech 2 tts models in a low resource environment. In 2022 IEEE International Conference on Data Science and Information System (ICDSIS), pages 1–6, 2022.
- Low-resource expressive text-to-speech using data augmentation. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6593–6597, 2021.
- Tacotron model and cnn in virtual reality for cancer diagnosis and communication between doctors and patients. In 2021 2nd International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), pages 448–453, 2021.
- Effect of data reduction on sequence-to-sequence neural tts. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7075–7079, 2019.
- Lightspeech: Lightweight and fast text to speech with neural architecture search. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5699–5703, 2021.
- Training multi-speaker neural text-to-speech systems using speaker-imbalanced speech corpora. In Interspeech, 2019.
- https://github.com/devilismyfriend/ozen-toolkit.
- Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015.
- Fastspeech 2: Fast and high-quality end-to-end text to speech, 2022.
- https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.
- Enhancing suno’s bark text-to-speech model: Addressing limitations through meta’s encodec and pre-trained hubert. SSRN, 2023. Available at SSRN: https://ssrn.com/abstract=4443815.
- https://github.com/serp-ai/bark-with-voice-clone.
- Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783, 2018.
- https://github.com/deezer/spleeter.
- Tacotron: Towards end-to-end speech synthesis, 2017.
- Adrian Łańcucki. Fastpitch: Parallel text-to-speech with pitch prediction. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6588–6592, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.