Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data (2309.02730v3)
Abstract: While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In this work, we propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content without requiring text transcriptions or speaker labeling. Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model to collect the speaking styles of a target speaker each corresponding to the different phonetic content. The styles are represented with a set of embeddings called stylebook. In the next step, the stylebook is attended with the source speech's phonetic content to determine the final target style for each source content. Finally, content information extracted from the source speech and content-dependent target style embeddings are fed into a diffusion-based decoder to generate the converted speech mel-spectrogram. Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks when compared to baseline models, while the increase in computational complexity with longer utterances is suppressed.
- “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in ICML, 2019.
- “VQMIVC: vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion,” in Interspeech, 2021.
- “A comparison of discrete and soft speech units for improved voice conversion,” in ICASSP, 2022.
- “Freevc: Towards high-quality text-free one-shot voice conversion,” in ICASSP, 2023.
- “Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,” in ICML, 2022.
- “Diffusion-based voice conversion with fast maximum likelihood sampling scheme,” in ICLR, 2022.
- “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1505–1518, 2022.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 3451–3460, 2021.
- “Voice conversion with just nearest neighbors,” in Interspeech, 2023.
- “Attention is all you need,” in NeurIPS, 2017.
- “Score-based generative modeling through stochastic differential equations,” in ICLR, 2021.
- “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
- “Perceiver: General perception with iterative attention,” in ICML, 2021.
- “Montreal forced aligner: Trainable text-speech alignment using kaldi,” in Interspeech, 2017.
- “Libritts: A corpus derived from librispeech for text-to-speech,” in Interspeech, 2019.
- “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in ICML, 2018.
- “Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding,” in Interspeech, 2020.
- “Content-dependent fine-grained speaker embedding for zero-shot speaker adaptation in text-to-speech synthesis,” in Interspeech, 2022.
- “Unitspeech: Speaker-adaptive speech synthesis with untranscribed data,” in Interspeech, 2023.
- J. Ho and T. Salimans, “Classifier-free diffusion guidance,” CoRR, vol. abs/2207.12598, 2022.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in NeurIPS, 2020.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
- “NISQA: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” in Interspeech, 2021.
- “ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Interspeech, 2020.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NeurIPS, 2020.
- Hyungseob Lim (2 papers)
- Kyungguen Byun (7 papers)
- Sunkuk Moon (4 papers)
- Erik Visser (15 papers)