Contrastive Speaker Embedding With Sequential Disentanglement (2309.13253v1)
Abstract: Contrastive speaker embedding assumes that the contrast between the positive and negative pairs of speech segments is attributed to speaker identity only. However, this assumption is incorrect because speech signals contain not only speaker identity but also linguistic content. In this paper, we propose a contrastive learning framework with sequential disentanglement to remove linguistic content by incorporating a disentangled sequential variational autoencoder (DSVAE) into the conventional SimCLR framework. The DSVAE aims to disentangle speaker factors from content factors in an embedding space so that only the speaker factors are used for constructing a contrastive loss objective. Because content factors have been removed from the contrastive learning, the resulting speaker embeddings will be content-invariant. Experimental results on VoxCeleb1-test show that the proposed method consistently outperforms SimCLR. This suggests that applying sequential disentanglement is beneficial to learning speaker-discriminative embeddings.
- “The IDLAB voxceleb speaker recognition challenge 2020 system description,” in arXiv preprint arXiv:2010.12468, 2020.
- “Augmentation adversarial training for self-supervised speaker recognition,” in Proc. Self-Supervised Learning for Speech and Audio Processing at NeurIPS Workshops, 2020.
- “An iterative framework for self-supervised deep speaker representation learning,” in Proc. International Conference on Acoustics, Speech, and Signal Processing, 2021, pp. 6728–6732.
- “Contrastive self-supervised learning for text-independent speaker verification,” in Proc. International Conference on Acoustics, Speech, and Signal Processing, 2021, pp. 6713–6717.
- “Self-supervised text-independent speaker verification using prototypical momentum contrastive learning,” in Proc. International Conference on Acoustics, Speech, and Signal Processing, 2021, pp. 6723–6727.
- “Self-supervised speaker recognition with loss-gated learning,” in Proc. International Conference on Acoustics, Speech, and Signal Processing, 2022, pp. 6142–6146.
- “Self-supervised speaker verification using dynamic loss-gate and label correction,” in Proc. Annual Conference of the International Speech Communication Association, 2022, pp. 4780–4784.
- “Non-contrastive self-supervised learning of utterance-level speech representations,” in Proc. Annual Conference of the International Speech Communication Association, 2022, pp. 4028–4032.
- “Pushing the limits of self-supervised speaker verification using regularized distillation framework,” in Proc. International Conference on Acoustics, Speech, and Signal Processing, 2023, pp. 1–5.
- “A theoretical analysis of contrastive unsupervised representation learning,” in Proc. International Conference on Machine Learning, 2019, pp. 5628–5637.
- “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
- “Probing the information encoded in x-vectors,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, 2019, pp. 726–733.
- “An empirical analysis of information encoded in disentangled neural speaker representations,” in Proc. Odyssey: The Speaker and Language Recognition Workshop, 2020, pp. 194–201.
- “Disentangled sequential autoencoder,” in Proc. International Conference on Machine Learning, 2018, pp. 5670–5679.
- “Contrastively disentangled sequential variational autoencoder,” in Advances in Neural Information Processing Systems, 2021, pp. 10105–10118.
- “Shuffle is what you need,” in Proc. International Symposium on Chinese Spoken Language Processing, 2022, pp. 245–249.
- “A simple framework for contrastive learning of visual representations,” in Proc. International Conference on Machine Learning, 2020, pp. 1597–1607.
- “Momentum contrast for unsupervised visual representation learning,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
- “Emerging properties in self-supervised vision transformers,” in Proc. International Conference on Computer Vision, 2021, pp. 9650–9660.
- “On the duality between contrastive and non-contrastive self-supervised learning,” in Proc. International Conference on Learning Representations, 2023.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. International Conference on Learning Representations, 2014.
- “Voxceleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, 2020.
- “MUSAN: A music, speech, and noise corpus,” in arXiv preprint arXiv:1510.08484, 2015.
- “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Annual Conference of the International Speech Communication Association, 2020, pp. 3830–3834.
- “Acoustic feature shuffling network for text-independent speaker verification,” in Proc. Annual Conference of the International Speech Communication Association, 2022, pp. 4790–4794.
- Youzhi Tu (3 papers)
- Man-Wai Mak (15 papers)
- Jen-Tzung Chien (6 papers)