Towards achieving robust universal neural vocoding (1811.06292v2)
Abstract: This paper explores the potential universality of neural vocoders. We train a WaveRNN-based vocoder on 74 speakers coming from 17 languages. This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality. When the recordings show significant changes in quality, or when moving towards non-speech vocalizations or singing, the vocoder still significantly outperforms speaker-dependent vocoders, but operates at a lower average relative MUSHRA of 75%. These results are shown to be consistent across languages, regardless of them being seen during training (e.g. English or Japanese) or unseen (e.g. Wolof, Swahili, Ahmaric).
- Jaime Lorenzo-Trueba (33 papers)
- Thomas Drugman (61 papers)
- Javier Latorre (6 papers)
- Thomas Merritt (16 papers)
- Bartosz Putrycz (8 papers)
- Roberto Barra-Chicote (24 papers)
- Alexis Moinet (22 papers)
- Vatsal Aggarwal (5 papers)