Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification (2309.04265v2)
Abstract: Contrastive self-supervised learning (CSL) for speaker verification (SV) has drawn increasing interest recently due to its ability to exploit unlabeled data. Performing data augmentation on raw waveforms, such as adding noise or reverberation, plays a pivotal role in achieving promising results in SV. Data augmentation, however, demands meticulous calibration to ensure intact speaker-specific information, which is difficult to achieve without speaker labels. To address this issue, we introduce a novel framework by incorporating clean and augmented segments into the contrastive training pipeline. The clean segments are repurposed to pair with noisy segments to form additional positive and negative pairs. Moreover, the contrastive loss is weighted to increase the difference between the clean and augmented embeddings of different speakers. Experimental results on Voxceleb1 suggest that the proposed framework can achieve a remarkable 19% improvement over the conventional methods, and it surpasses many existing state-of-the-art techniques.
- “An overview of text-independent speaker recognition: From features to supervectors,” Speech Communication, vol. 52, pp. 12–40, 2010.
- “A survey on text-dependent and text-independent speaker verification,” IEEE Access, vol. 10, pp. 99038–99049, 2022.
- Machine learning for speaker recognition, Cambridge University Press, 2020.
- “Front-end speech enhancement for commercial speaker verification systems,” Speech Communication, vol. 99, pp. 101–113, 2018.
- “Forensic speaker recognition,” IEEE Signal Processing Magazine, vol. 26, no. 2, pp. 95–103, 2009.
- “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
- “Probabilistic linear discriminant analysis for inferences about identity,” in Proc. IEEE 11th International Conference on Computer Vision, 2007, pp. 1–8.
- “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 5329–5333.
- “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in Proc. IEEE Spoken Language Technology Workshop, 2016, pp. 165–170.
- “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Interspeech, 2020, pp. 3830–3834.
- “Delving into voxceleb: environment invariant speaker recognition,” arXiv preprint arXiv:1910.11238, 2019.
- “Generalized end-to-end loss for speaker verification,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 4879–4883.
- “Large margin softmax loss for speaker verification,” in Proc. Interspeech, 2019, pp. 2873–2877.
- “Angular softmax for short-duration text-independent speaker verification.,” in Proc. Interspeech, 2018, pp. 3623–3627.
- “End-to-end text-independent speaker verification with triplet loss on short utterances.,” in Proc. Interspeech, 2017, pp. 1487–1491.
- “A simple framework for contrastive learning of visual representations,” in Proc. International Conference on Machine Learning, 2020, pp. 1597–1607.
- “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
- “Contrastive self-supervised learning for text-independent speaker verification,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 6713–6717.
- “Self-supervised text-independent speaker verification using prototypical momentum contrastive learning,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 6723–6727.
- “Unsupervised representation learning for speaker recognition via contrastive equilibrium learning,” arXiv preprint arXiv:2010.11433, 2020.
- “Label-efficient self-supervised speaker verification with information maximization and contrastive learning,” in Proc. Interspeech, 2022, pp. 4018–4022.
- “What makes for good views for contrastive learning?,” Advances in Neural Information Processing Systems, vol. 33, pp. 6827–6839, 2020.
- “What should not be contrastive in contrastive learning,” in Proc. International Conference on Learning Representations, 2021.
- “Augmentation adversarial training for self-supervised speaker recognition,” arXiv preprint arXiv:2007.12085, 2020.
- “Self-supervised speaker recognition with loss-gated learning,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 6142–6146.
- “Contrastive adversarial domain adaptation networks for speaker recognition,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 5, pp. 2236–2245, 2022.
- “Contrastive learning with hard negative samples,” in Proc. International Conference on Learning Representations, 2021.
- “Hard negative mixing for contrastive learning,” in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 21798–21809.
- “Discriminative speaker representation via contrastive learning with class-aware attention in angular space,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
- “Voxceleb2: Deep speaker recognition,” in Proc. Interspeech, 2018, pp. 1086–1090.
- “Voxceleb: A large-scale speaker identification dataset,” in Proc. Interspeech, 2017, pp. 2616–2620.
- “MUSAN: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
- “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 5220–5224.
- “Disentangled speech embeddings using cross-modal self-supervision,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 6829–6833.