Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improved disentangled speech representations using contrastive learning in factorized hierarchical variational autoencoder (2211.08191v2)

Published 15 Nov 2022 in eess.AS and cs.LG

Abstract: Leveraging the fact that speaker identity and content vary on different time scales, \acrlong{fhvae} (\acrshort{fhvae}) uses different latent variables to symbolize these two attributes. Disentanglement of these attributes is carried out by different prior settings of the corresponding latent variables. For the prior of speaker identity variable, \acrshort{fhvae} assumes it is a Gaussian distribution with an utterance-scale varying mean and a fixed variance. By setting a small fixed variance, the training process promotes identity variables within one utterance gathering close to the mean of their prior. However, this constraint is relatively weak, as the mean of the prior changes between utterances. Therefore, we introduce contrastive learning into the \acrshort{fhvae} framework, to make the speaker identity variables gathering when representing the same speaker, while distancing themselves as far as possible from those of other speakers. The model structure has not been changed in this work but only the training process, thus no additional cost is needed during testing. Voice conversion has been chosen as the application in this paper. Latent variable evaluations include speaker verification and identification for the speaker identity variable, and speech recognition for the content variable. Furthermore, assessments of voice conversion performance are on the grounds of fake speech detection experiments. Results show that the proposed method improves both speaker identity and content feature extraction compared to \acrshort{fhvae}, and has better performance than baseline on conversion.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
  2. “Speech Resynthesis from Discrete Disentangled Self-Supervised Representations,” in Proc. Interspeech, 2021.
  3. “Privacy-preserving voice analysis via disentangled representations,” in Proceedings of the 2020 ACM SIGSAC Conference on Cloud Computing Security Workshop, 2020, pp. 1–14.
  4. “Challenging common assumptions in the unsupervised learning of disentangled representations,” in International Conference on Machine Learning. PMLR, 2019, pp. 4114–4124.
  5. “Unsupervised learning of disentangled and interpretable representations from sequential data,” in NIPS, 2017.
  6. “Disentangling content and fine-grained prosody information via hybrid asr bottleneck features for voice conversion,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7022–7026.
  7. “Speechsplit2. 0: Unsupervised speech disentanglement for voice conversion without tuning autoencoder bottlenecks,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6332–6336.
  8. “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in International Conference on Machine Learning. PMLR, 2019, pp. 5210–5219.
  9. “Avqvc: One-shot voice conversion by vector quantization with applying contrastive learning,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 4613–4617.
  10. “Contrastive predictive coding supported factorized variational autoencoder for unsupervised learning of disentangled speech representations,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3860–3864.
  11. “Robust disentangled variational speech representation learning for zero-shot voice conversion,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6572–6576.
  12. “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
  13. “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). IEEE Computer Society, 2006, vol. 2, pp. 1735–1742.
  14. “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
  15. “Time-contrastive learning based deep bottleneck features for text-dependent speaker verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1267–1279, 2019.
  16. “Adam: A method for stochastic optimization,” in ICLR, 2015.
  17. “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020.
  18. D. Povey and A. Ghoshal, “The kaldi speech recognition toolkit,” in IEEE Workshop on ASRU, 2011.
  19. “Disentangled speech representation learning based on factorized hierarchical variational autoencoder with self-supervised objective,” in 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2021, pp. 1–6.
  20. Laurens Van der Maaten and Geoffrey Hinton, “Visualizing data using t-sne.,” Journal of machine learning research, vol. 9, no. 11, 2008.
Citations (2)

Summary

We haven't generated a summary for this paper yet.