Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Contrastive Speaker Embedding With Sequential Disentanglement (2309.13253v1)

Published 23 Sep 2023 in eess.AS and cs.SD

Abstract: Contrastive speaker embedding assumes that the contrast between the positive and negative pairs of speech segments is attributed to speaker identity only. However, this assumption is incorrect because speech signals contain not only speaker identity but also linguistic content. In this paper, we propose a contrastive learning framework with sequential disentanglement to remove linguistic content by incorporating a disentangled sequential variational autoencoder (DSVAE) into the conventional SimCLR framework. The DSVAE aims to disentangle speaker factors from content factors in an embedding space so that only the speaker factors are used for constructing a contrastive loss objective. Because content factors have been removed from the contrastive learning, the resulting speaker embeddings will be content-invariant. Experimental results on VoxCeleb1-test show that the proposed method consistently outperforms SimCLR. This suggests that applying sequential disentanglement is beneficial to learning speaker-discriminative embeddings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. “The IDLAB voxceleb speaker recognition challenge 2020 system description,” in arXiv preprint arXiv:2010.12468, 2020.
  2. “Augmentation adversarial training for self-supervised speaker recognition,” in Proc. Self-Supervised Learning for Speech and Audio Processing at NeurIPS Workshops, 2020.
  3. “An iterative framework for self-supervised deep speaker representation learning,” in Proc. International Conference on Acoustics, Speech, and Signal Processing, 2021, pp. 6728–6732.
  4. “Contrastive self-supervised learning for text-independent speaker verification,” in Proc. International Conference on Acoustics, Speech, and Signal Processing, 2021, pp. 6713–6717.
  5. “Self-supervised text-independent speaker verification using prototypical momentum contrastive learning,” in Proc. International Conference on Acoustics, Speech, and Signal Processing, 2021, pp. 6723–6727.
  6. “Self-supervised speaker recognition with loss-gated learning,” in Proc. International Conference on Acoustics, Speech, and Signal Processing, 2022, pp. 6142–6146.
  7. “Self-supervised speaker verification using dynamic loss-gate and label correction,” in Proc. Annual Conference of the International Speech Communication Association, 2022, pp. 4780–4784.
  8. “Non-contrastive self-supervised learning of utterance-level speech representations,” in Proc. Annual Conference of the International Speech Communication Association, 2022, pp. 4028–4032.
  9. “Pushing the limits of self-supervised speaker verification using regularized distillation framework,” in Proc. International Conference on Acoustics, Speech, and Signal Processing, 2023, pp. 1–5.
  10. “A theoretical analysis of contrastive unsupervised representation learning,” in Proc. International Conference on Machine Learning, 2019, pp. 5628–5637.
  11. “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
  12. “Probing the information encoded in x-vectors,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, 2019, pp. 726–733.
  13. “An empirical analysis of information encoded in disentangled neural speaker representations,” in Proc. Odyssey: The Speaker and Language Recognition Workshop, 2020, pp. 194–201.
  14. “Disentangled sequential autoencoder,” in Proc. International Conference on Machine Learning, 2018, pp. 5670–5679.
  15. “Contrastively disentangled sequential variational autoencoder,” in Advances in Neural Information Processing Systems, 2021, pp. 10105–10118.
  16. “Shuffle is what you need,” in Proc. International Symposium on Chinese Spoken Language Processing, 2022, pp. 245–249.
  17. “A simple framework for contrastive learning of visual representations,” in Proc. International Conference on Machine Learning, 2020, pp. 1597–1607.
  18. “Momentum contrast for unsupervised visual representation learning,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
  19. “Emerging properties in self-supervised vision transformers,” in Proc. International Conference on Computer Vision, 2021, pp. 9650–9660.
  20. “On the duality between contrastive and non-contrastive self-supervised learning,” in Proc. International Conference on Learning Representations, 2023.
  21. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  22. D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. International Conference on Learning Representations, 2014.
  23. “Voxceleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, 2020.
  24. “MUSAN: A music, speech, and noise corpus,” in arXiv preprint arXiv:1510.08484, 2015.
  25. “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Annual Conference of the International Speech Communication Association, 2020, pp. 3830–3834.
  26. “Acoustic feature shuffling network for text-independent speaker verification,” in Proc. Annual Conference of the International Speech Communication Association, 2022, pp. 4790–4794.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Youzhi Tu (3 papers)
  2. Man-Wai Mak (15 papers)
  3. Jen-Tzung Chien (6 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.